computer architecture: interconnects (part i) prof. onur mutlu carnegie mellon university

281
Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Upload: alison-richard

Post on 21-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture:Interconnects (Part I)

Prof. Onur MutluCarnegie Mellon University

Page 2: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

2

Memory Interference and QoS Lectures These slides are from a lecture delivered at Bogazici

University (June 10, 2013)

Videos: https://www.youtube.com/playlist?list=PLVngZ7BemHH

V6N0ejHhwOfLwTr8Q-UKXj

https://www.youtube.com/watch?v=rKRFlpavGs8&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=3

https://www.youtube.com/watch?v=go34f7v9KIs&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=4

https://www.youtube.com/watch?v=nWDoUqoZY4s&list=PLVngZ7BemHHV6N0ejHhwOfLwTr8Q-UKXj&index=5

Page 3: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Multi-Core Architectures and Shared Resource Management

Lecture 3: Interconnects

Prof. Onur Mutluhttp://www.ece.cmu.edu/~omutlu

[email protected] University

June 10, 2013

Page 4: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Last Lecture Wrap up Asymmetric Multi-Core Systems

Handling Private Data Locality Asymmetry Everywhere

Resource Sharing vs. Partitioning Cache Design for Multi-core Architectures

MLP-aware Cache Replacement The Evicted-Address Filter Cache Base-Delta-Immediate Compression Linearly Compressed Pages Utility Based Cache Partitioning Fair Shared Cache Partitinoning Page Coloring Based Cache Partitioning

4

Page 5: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Agenda for Today Interconnect design for multi-core systems

(Prefetcher design for multi-core systems)

(Data Parallelism and GPUs)

5

Page 6: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Readings for Lecture June 6 (Lecture 1.1) Required – Symmetric and Asymmetric Multi-Core Systems

Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003, IEEE Micro 2003.

Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009, IEEE Micro 2010.

Suleman et al., “Data Marshaling for Multi-Core Architectures,” ISCA 2010, IEEE Micro 2011.

Joao et al., “Bottleneck Identification and Scheduling for Multithreaded Applications,” ASPLOS 2012.

Joao et al., “Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs,” ISCA 2013.

Recommended Amdahl, “Validity of the single processor approach to achieving large

scale computing capabilities,” AFIPS 1967. Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS

1996. Mutlu et al., “Techniques for Efficient Processing in Runahead

Execution Engines,” ISCA 2005, IEEE Micro 2006.

6

Page 8: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Readings for Lecture June 7 (Lecture 1.2) Required – Caches in Multi-Core

Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005.

Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012.

Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.

Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” SAFARI Technical Report 2013.

Recommended Qureshi et al., “Utility-Based Cache Partitioning: A Low-Overhead,

High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.

8

Page 9: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Videos for Lecture June 7 (Lecture 1.2) Cache basics:

http://www.youtube.com/watch?v=TpMdBrM1hVc&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=23

Advanced caches: http://www.youtube.com/watch?v=TboaFbjTd-E&list=P

L5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=24

9

Page 10: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Readings for Lecture June 10 (Lecture 1.3) Required – Interconnects in Multi-Core

Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009.

Fallin et al., “CHIPPER: A Low-Complexity Bufferless Deflection Router,” HPCA 2011.

Fallin et al., “MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect,” NOCS 2012.

Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009.

Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010, IEEE Micro 2011.

Recommended Grot et al. “Preemptive Virtual Clock: A Flexible, Efficient, and

Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.

Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011, IEEE Micro 2012.

10

Page 11: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

11

More Readings for Lecture 1.3 Studies of congestion and congestion control in on-

chip vs. internet-like networks

George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan,"On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects"

Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx)

George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu,"Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?" Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HOTNETS), Monterey, CA, October 2010. Slides (ppt) (key)

Page 12: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Videos for Lecture June 10 (Lecture 1.3) Interconnects

http://www.youtube.com/watch?v=6xEpbFVgnf8&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=33

GPUs and SIMD processing Vector/array processing basics:

http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15

GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19

GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20

12

Page 13: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Readings for Prefetching Prefetching

Srinath et al., “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers,” HPCA 2007.

Ebrahimi et al., “Coordinated Control of Multiple Prefetchers in Multi-Core Systems,” MICRO 2009.

Ebrahimi et al., “Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems,” HPCA 2009.

Ebrahimi et al., “Prefetch-Aware Shared Resource Management for Multi-Core Systems,” ISCA 2011.

Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008.

Recommended Lee et al., “Improving Memory Bank-Level Parallelism in the

Presence of Prefetching,” MICRO 2009.13

Page 15: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Readings for GPUs and SIMD GPUs and SIMD processing

Narasiman et al., “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” MICRO 2011.

Jog et al., “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance,” ASPLOS 2013.

Jog et al., “Orchestrated Scheduling and Prefetching for GPGPUs,” ISCA 2013.

Lindholm et al., “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro 2008.

Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” MICRO 2007.

15

Page 16: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Videos for GPUs and SIMD GPUs and SIMD processing

Vector/array processing basics: http://www.youtube.com/watch?v=f-XL4BNRoBA&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=15

GPUs versus other execution models: http://www.youtube.com/watch?v=dl5TZ4-oao0&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=19

GPUs in more detail: http://www.youtube.com/watch?v=vr5hbSkb1Eg&list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ&index=20

16

Page 17: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Online Lectures and More Information Online Computer Architecture Lectures

http://www.youtube.com/playlist?list=PL5PHm2jkkXmidJOd59REog9jDnPDTG6IJ

Online Computer Architecture Courses Intro: http://www.ece.cmu.edu/~ece447/s13/doku.php Advanced:

http://www.ece.cmu.edu/~ece740/f11/doku.php Advanced: http://www.ece.cmu.edu/~ece742/doku.php

Recent Research Papers http://users.ece.cmu.edu/~omutlu/projects.htm http://scholar.google.com/citations?user=7XyGUGkAAAA

J&hl=en 17

Page 18: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Interconnect Basics

18

Page 19: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Interconnect in a Multi-Core System

19

SharedStorage

Page 20: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Where Is Interconnect Used? To connect components

Many examples Processors and processors Processors and memories (banks) Processors and caches (banks) Caches and caches I/O devices

20

Interconnection network

Page 21: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Why Is It Important? Affects the scalability of the system

How large of a system can you build? How easily can you add more processors?

Affects performance and energy efficiency How fast can processors, caches, and memory

communicate? How long are the latencies to memory? How much energy is spent on communication?

21

Page 22: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Interconnection Network Basics Topology

Specifies the way switches are wired Affects routing, reliability, throughput, latency, building

ease

Routing (algorithm) How does a message get from source to destination Static or adaptive

Buffering and Flow Control What do we store within the network?

Entire packets, parts of packets, etc? How do we throttle during oversubscription? Tightly coupled with routing strategy

22

Page 23: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Topology Bus (simplest) Point-to-point connections (ideal and most costly) Crossbar (less costly) Ring Tree Omega Hypercube Mesh Torus Butterfly …

23

Page 24: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Metrics to Evaluate Interconnect Topology Cost Latency (in hops, in nanoseconds) Contention

Many others exist you should think about Energy Bandwidth Overall system performance

24

Page 25: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Bus+ Simple+ Cost effective for a small number of nodes+ Easy to implement coherence (snooping and

serialization)- Not scalable to large number of nodes (limited

bandwidth, electrical loading reduced frequency)- High contention fast saturation

25

MemoryMemoryMemoryMemory

Proc

cache

Proc

cache

Proc

cache

Proc

cache

0 1 2 3 4 5 6 7

Page 26: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Point-to-Point Every node connected to every other

+ Lowest contention+ Potentially lowest latency+ Ideal, if cost is not an issue

-- Highest cost O(N) connections/ports per node O(N2) links-- Not scalable-- How to lay out on chip?

26

0

1

2

3

4

5

6

7

Page 27: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Crossbar Every node connected to every other (non-blocking)

except one can be using the connection at any given time

Enables concurrent sends to non-conflicting destinations

Good for small number of nodes

+ Low latency and high throughput- Expensive- Not scalable O(N2) cost- Difficult to arbitrate as N increases

Used in core-to-cache-banknetworks in- IBM POWER5- Sun Niagara I/II

27

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Page 28: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Another Crossbar Design

28

0 1 2 3 4 5 6 7

0

1

2

3

4

5

6

7

Page 29: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Sun UltraSPARC T2 Core-to-Cache Crossbar High bandwidth

interface between 8 cores and 8 L2 banks & NCU

4-stage pipeline: req, arbitration, selection, transmission

2-deep queue for each src/dest pair to hold data transfer request 29

Page 30: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Buffered Crossbar

30

OutputArbiter

OutputArbiter

OutputArbiter

OutputArbiter

FlowControl

FlowControl

FlowControl

FlowControl

NI

NI

NI

NI

BufferedCrossbar

0

1

2

3

NI

NI

NI

NI

BufferlessCrossbar

0

1

2

3

+ Simpler arbitration/scheduling

+ Efficient support for variable-size packets

- Requires N2 buffers

Page 31: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Can We Get Lower Cost than A Crossbar? Yet still have low contention?

Idea: Multistage networks

31

Page 32: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Multistage Logarithmic Networks Idea: Indirect networks with multiple layers of

switches between terminals/nodes Cost: O(NlogN), Latency: O(logN) Many variations (Omega, Butterfly, Benes, Banyan,

…) Omega Network:

32

000

001

010011

100

101

110111

000

001

010011

100

101

110111

Omega Network

conflict

Page 33: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Multistage Circuit Switched

More restrictions on feasible concurrent Tx-Rx pairs But more scalable than crossbar in cost, e.g., O(N logN) for

Butterfly33

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

2-by-2 crossbar

Page 34: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Multistage Packet Switched

Packets “hop” from router to router, pending availability of the next-required switch and buffer

34

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

2-by-2 router

Page 35: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Aside: Circuit vs. Packet Switching Circuit switching sets up full path

Establish route then send data (no one else can use those links)+ faster arbitration-- setting up and bringing down links takes time

Packet switching routes per packet Route each packet individually (possibly via different

paths) if link is free, any packet can use it-- potentially slower --- must dynamically switch+ no setup, bring down time+ more flexible, does not underutilize links

35

Page 36: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Switching vs. Topology Circuit/packet switching choice independent of

topology It is a higher-level protocol on how a message gets

sent to a destination

However, some topologies are more amenable to circuit vs. packet switching

36

Page 37: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Another Example: Delta Network Single path from source to

destination

Does not support all possible permutations

Proposed to replace costly crossbars as processor-memory interconnect

Janak H. Patel ,“Processor-Memory Interconnections for Multiprocessors,” ISCA 1979.

37

8x8 Delta network

Page 38: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Another Example: Omega Network Single path from

source to destination

All stages are the same

Used in NYU Ultracomputer

Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982. 38

Page 39: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Ring+ Cheap: O(N) cost- High latency: O(N)- Not easy to scale - Bisection bandwidth remains constant

Used in Intel Haswell, Intel Larrabee, IBM Cell, many commercial systems today

39

M

P

RING

S

M

P

S

M

P

S

Page 40: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Unidirectional Ring

Simple topology and implementation Reasonable performance if N and performance needs

(bandwidth & latency) still moderately low O(N) cost N/2 average hops; latency depends on utilization

40

R

0

R

1

R

N-2

R

N-1

2

2x2 router

Page 41: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Bidirectional Rings+ Reduces latency+ Improves scalability

- Slightly more complex injection policy (need to select which ring to inject a packet into)

41

Page 42: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Hierarchical Rings

+ More scalable+ Lower latency

- More complex

42

Page 43: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

More on Hierarchical Rings Chris Fallin, Xiangyao Yu, Kevin Chang, Rachata

Ausavarungnirun, Greg Nazario, Reetuparna Das, and Onur Mutlu,"HiRD: A Low-Complexity, Energy-Efficient Hierarchical Ring Interconnect"

SAFARI Technical Report, TR-SAFARI-2012-004, Carnegie Mellon University, December 2012.

Discusses the design and implementation of a mostly-bufferless hierarchical ring

43

Page 44: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Mesh O(N) cost Average latency: O(sqrt(N)) Easy to layout on-chip: regular and equal-length

links Path diversity: many ways to get from one node to

another

Used in Tilera 100-core And many on-chip network prototypes

44

Page 45: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Torus Mesh is not symmetric on edges: performance very

sensitive to placement of task on edge vs. middle Torus avoids this problem+ Higher path diversity (and bisection bandwidth)

than mesh- Higher cost- Harder to lay out on-chip - Unequal link lengths

45

Page 46: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Torus, continued Weave nodes to make inter-node latencies

~constant

46

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

Page 47: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Planar, hierarchical topologyLatency: O(logN)Good for local traffic+ Cheap: O(N) cost+ Easy to Layout- Root can become a bottleneck Fat trees avoid this problem (CM-5)

Trees

47

H-Tree

Fat Tree

Page 48: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

CM-5 Fat Tree Fat tree based on 4x2 switches Randomized routing on the way up Combining, multicast, reduction operators

supported in hardware Thinking Machines Corp., “The Connection Machine

CM-5 Technical Summary,” Jan. 1992.

48

Page 49: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Hypercube

Latency: O(logN) Radix: O(logN) #links: O(NlogN)+ Low latency- Hard to lay out in 2D/3D

49

0-D 1-D 2-D 3-D 4-D

0000

0101

0100

0001

0011

0010

0110

0111

1000

1101

1100

1001

1011

1010

1110

1111

Page 50: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Caltech Cosmic Cube 64-node message passing

machine

Seitz, “The Cosmic Cube,” CACM 1985.

50

Page 51: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Handling Contention

Two packets trying to use the same link at the same time

What do you do? Buffer one Drop one Misroute one (deflection)

Tradeoffs?51

Page 52: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Destination

Bufferless Deflection Routing Key idea: Packets are never buffered in the network.

When two packets contend for the same link, one is deflected.1

521Baran, “On Distributed Communication Networks.” RAND Tech. Report., 1962 / IEEE Trans.Comm., 1964.

New traffic can be injectedwhenever there is a freeoutput link.

Page 53: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Bufferless Deflection Routing Input buffers are eliminated: flits are buffered in

pipeline latches and on network links

53

North

South

East

West

Local

North

South

East

West

Local

Deflection Routing Logic

Input Buffers

Page 54: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Routing Algorithm Types

Deterministic: always chooses the same path for a communicating source-destination pair

Oblivious: chooses different paths, without considering network state

Adaptive: can choose different paths, adapting to the state of the network

How to adapt Local/global feedback Minimal or non-minimal paths

54

Page 55: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Deterministic Routing All packets between the same (source, dest) pair

take the same path

Dimension-order routing E.g., XY routing (used in Cray T3D, and many on-chip

networks) First traverse dimension X, then traverse dimension Y

+ Simple+ Deadlock freedom (no cycles in resource allocation)- Could lead to high contention- Does not exploit path diversity

55

Page 56: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Deadlock No forward progress Caused by circular dependencies on resources Each packet waits for a buffer occupied by another

packet downstream

56

Page 57: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Handling Deadlock Avoid cycles in routing

Dimension order routing Cannot build a circular dependency

Restrict the “turns” each packet can take

Avoid deadlock by adding more buffering (escape paths)

Detect and break deadlock Preemption of buffers

57

Page 58: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Turn Model to Avoid Deadlock Idea

Analyze directions in which packets can turn in the network Determine the cycles that such turns can form Prohibit just enough turns to break possible cycles

Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992.

58

Page 59: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Oblivious Routing: Valiant’s Algorithm An example of oblivious algorithm Goal: Balance network load Idea: Randomly choose an intermediate destination,

route to it first, then route from there to destination Between source-intermediate and intermediate-dest,

can use dimension order routing

+ Randomizes/balances network load- Non minimal (packet latency can increase)

Optimizations: Do this on high load Restrict the intermediate node to be close (in the same

quadrant)59

Page 60: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Adaptive Routing Minimal adaptive

Router uses network state (e.g., downstream buffer occupancy) to pick which “productive” output port to send a packet to

Productive output port: port that gets the packet closer to its destination

+ Aware of local congestion- Minimality restricts achievable link utilization (load

balance)

Non-minimal (fully) adaptive “Misroute” packets to non-productive output ports

based on network state+ Can achieve better network utilization and load

balance- Need to guarantee livelock freedom

60

Page 61: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

On-Chip Networks

61

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R Router

Processing Element(Cores, L2 Banks, Memory Controllers, etc)PE

• Connect cores, caches, memory controllers, etc– Buses and crossbars are not scalable

• Packet switched• 2D mesh: Most commonly used

topology• Primarily serve cache misses and

memory requests

Page 62: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

On-chip Networks

62

From East

From West

From North

From South

From PE

VC 0

VC Identifier

VC 1

VC 2

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

Crossbar (5 x 5)

To East

To PE

To West

To North

To South

Input Port with Buffers

Control Logic

Crossbar

R Router

PE

Processing Element(Cores, L2 Banks, Memory Controllers etc)

Routing Unit (RC)

VC Allocator(VA)

Switch Allocator (SA)

Page 63: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

On-Chip vs. Off-Chip Interconnects On-chip advantages

Low latency between cores No pin constraints Rich wiring resources Very high bandwidth Simpler coordination

On-chip constraints/disadvantages 2D substrate limits implementable topologies Energy/power consumption a key concern Complex algorithms undesirable Logic area constrains use of wiring resources

63

Page 64: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

On-Chip vs. Off-Chip Interconnects (II) Cost

Off-chip: Channels, pins, connectors, cables On-chip: Cost is storage and switches (wires are

plentiful) Leads to networks with many wide channels, few

buffers

Channel characteristics On chip short distance low latency On chip RC lines need repeaters every 1-2mm

Can put logic in repeaters

Workloads Multi-core cache traffic vs. supercomputer interconnect

traffic 64

Page 65: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Motivation for Efficient Interconnect In many-core chips, on-chip interconnect (NoC)

consumes significant powerIntel Terascale: ~28% of chip power

Intel SCC: ~10%

MIT RAW: ~36%

Recent work1 uses bufferless deflection routing to reduce power and die area

65

Core L1

L2 Slice Router

1Moscibroda and Mutlu, “A Case for Bufferless Deflection Routing in On-Chip Networks.” ISCA 2009.

Page 66: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Research in Interconnects

Page 67: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Research Topics in Interconnects Plenty of topics in interconnection networks.

Examples:

Energy/power efficient and proportional design Reducing Complexity: Simplified router and protocol

designs Adaptivity: Ability to adapt to different access patterns QoS and performance isolation

Reducing and controlling interference, admission control Co-design of NoCs with other shared resources

End-to-end performance, QoS, power/energy optimization

Scalable topologies to many cores, heterogeneous systems

Fault tolerance Request prioritization, priority inversion, coherence, … New technologies (optical, 3D)

67

Page 68: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

One Example: Packet Scheduling Which packet to choose for a given output port?

Router needs to prioritize between competing flits Which input port? Which virtual channel? Which application’s packet?

Common strategies Round robin across virtual channels Oldest packet first (or an approximation) Prioritize some virtual channels over others

Better policies in a multi-core environment Use application characteristics Minimize energy

68

Page 69: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture:Interconnects (Part I)

Prof. Onur MutluCarnegie Mellon University

Page 70: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Bufferless Routing

Thomas Moscibroda and Onur Mutlu, "A Case for Bufferless Routing in On-Chip Networks"

Proceedings of the 36th International Symposium on Computer Architecture

(ISCA), pages 196-207, Austin, TX, June 2009. Slides (pptx)

70

Page 71: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Connect cores, caches, memory controllers, etc…

• Examples: • Intel 80-core Terascale chip• MIT RAW chip

• Design goals in NoC design:• High throughput, low latency• Fairness between cores, QoS, … • Low complexity, low cost • Power, low energy consumption

On-Chip Networks (NoC)

Energy/Power in On-Chip Networks

• Power is a key constraint in the design

of high-performance processors

• NoCs consume substantial portion of system

power• ~30% in Intel 80-core Terascale

[IEEE Micro’07]

• ~40% in MIT RAW Chip [ISCA’04]

• NoCs estimated to consume 100s of Watts [Borkar, DAC’07]

Page 72: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Existing approaches differ in numerous ways: • Network topology [Kim et al, ISCA’07, Kim et al, ISCA’08 etc]

• Flow control [Michelogiannakis et al, HPCA’09, Kumar et al, MICRO’08, etc]

• Virtual Channels [Nicopoulos et al, MICRO’06, etc]

• QoS & fairness mechanisms [Lee et al, ISCA’08, etc]

• Routing algorithms [Singh et al, CAL’04]

• Router architecture [Park et al, ISCA’08]

• Broadcast, Multicast [Jerger et al, ISCA’08, Rodrigo et al, MICRO’08]

Current NoC Approaches

Existing work assumes existence of

buffers in routers!

Page 73: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

A Typical Router

Routing Computation

VC Arbiter

Switch Arbiter

VC1

VC2

VCv

VC1

VC2

VCv

Input Port N

Input Port 1

N x N Crossbar

Input Channel 1

Input Channel N

Scheduler

Output Channel 1

Output Channel N

Credit Flowto upstreamrouter

Buffers are integral part of existing NoC Routers

Credit Flowto upstreamrouter

Page 74: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Buffers are necessary for high network throughput

buffers increase total available bandwidth in network

Buffers in NoC Routers

Injection Rate

Avg.

pack

et

late

ncy

largebuffers

mediumbuffers

smallbuffers

Page 75: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Buffers are necessary for high network throughput

buffers increase total available bandwidth in network

• Buffers consume significant energy/power• Dynamic energy when read/write

• Static energy even when not occupied

• Buffers add complexity and latency • Logic for buffer management

• Virtual channel allocation

• Credit-based flow control

• Buffers require significant chip area• E.g., in TRIPS prototype chip, input buffers

occupy 75% of total on-chip network area [Gratz et al, ICCD’06]

Buffers in NoC Routers

Page 76: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• How much throughput do we lose? How is latency affected?

• Up to what injection rates can we use bufferless routing?

Are there realistic scenarios in which NoC is operated at injection rates below the threshold?

• Can we achieve energy reduction? If so, how much…?

• Can we reduce area, complexity, etc…?

Going Bufferless…?

Injection Rate

late

ncy

buffersnobuffers

Page 77: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Always forward all incoming flits to some output port

• If no productive direction is available, send to another direction

• packet is deflected

Hot-potato routing [Baran’64, etc]

BLESS: Bufferless Routing

Buffered BLESS

Deflected!

Page 78: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

BLESS: Bufferless Routing

Routing

VC ArbiterSwitch Arbiter

Flit-Ranking

Port-Prioritizatio

n

arbitration policy

Flit-Ranking 1. Create a ranking over all incoming flits

Port-Prioritizatio

n2. For a given flit in this ranking, find the best free output-port

Apply to each flit in order of ranking

Page 79: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Each flit is routed independently. • Oldest-first arbitration (other policies evaluated in paper)

• Network Topology: Can be applied to most topologies (Mesh, Torus, Hypercube, Trees, …)

1) #output ports ¸ #input ports at every router2) every router is reachable from every other router

• Flow Control & Injection Policy: Completely local, inject whenever input port is free

• Absence of Deadlocks: every flit is always moving• Absence of Livelocks: with oldest-first ranking

FLIT-BLESS: Flit-Level Routing

Flit-Ranking 1. Oldest-first ranking

Port-Prioritizati

on

2. Assign flit to productive port, if possible.Otherwise, assign to non-productive port.

Page 80: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• BLESS without buffers is extreme end of a continuum

• BLESS can be integrated with buffers • FLIT-BLESS with Buffers

• WORM-BLESS with Buffers

• Whenever a buffer is full, it’s first flit becomes must-schedule

• must-schedule flits must be deflected if necessary

BLESS with Buffers

See paper for details…

Page 81: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Advantages• No buffers• Purely local flow

control• Simplicity

- no credit-flows- no virtual channels- simplified router design

• No deadlocks, livelocks

• Adaptivity- packets are deflected around congested areas!

• Router latency reduction

• Area savings

BLESS: Advantages & Disadvantages

Disadvantages• Increased latency• Reduced

bandwidth• Increased buffering

at receiver• Header information

at each flit• Oldest-first

arbitration complex• QoS becomes

difficult

Impact on energy…?

Page 82: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• BLESS gets rid of input buffers and virtual channels

Reduction of Router Latency

BWRC

VASA

STLT

BW SA STLT

RC STLT

RC STLT

LA LT

BW: Buffer WriteRC: Route ComputationVA: Virtual Channel AllocationSA: Switch AllocationST: Switch TraversalLT: Link TraversalLA LT: Link Traversal of Lookahead Baseline

Router(speculative)

headflit

bodyflit

BLESSRouter

(standard)

RC STLT

RC STLT

Router 1

Router 2

Router 1

Router 2

BLESSRouter

(optimized)

Router Latency = 3

Router Latency = 2

Router Latency = 1

Can be improved to 2.

[Dally, Towles’04]

our evaluations

Page 83: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Advantages• No buffers• Purely local flow

control• Simplicity

- no credit-flows- no virtual channels- simplified router design

• No deadlocks, livelocks

• Adaptivity- packets are deflected around congested areas!

• Router latency reduction

• Area savings

BLESS: Advantages & Disadvantages

Disadvantages• Increased latency• Reduced

bandwidth• Increased buffering

at receiver• Header information

at each flit

Impact on energy…?

Extensive evaluations in the paper!

Page 84: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• 2D mesh network, router latency is 2 cycleso 4x4, 8 core, 8 L2 cache banks (each node is a core or an L2

bank)o 4x4, 16 core, 16 L2 cache banks (each node is a core and an

L2 bank)o 8x8, 16 core, 64 L2 cache banks (each node is L2 bank and

may be a core)o 128-bit wide links, 4-flit data packets, 1-flit address packetso For baseline configuration: 4 VCs per physical input port, 1

packet deep

• Benchmarkso Multiprogrammed SPEC CPU2006 and Windows Desktop

applicationso Heterogeneous and homogenous application mixeso Synthetic traffic patterns: UR, Transpose, Tornado, Bit

Complement

• x86 processor model based on Intel Pentium Mo 2 GHz processor, 128-entry instruction windowo 64Kbyte private L1 cacheso Total 16Mbyte shared L2 caches; 16 MSHRs per banko DRAM model based on Micron DDR2-800

Evaluation Methodology

Page 85: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Energy model provided by Orion simulator [MICRO’02]

o 70nm technology, 2 GHz routers at 1.0 Vdd

• For BLESS, we model o Additional energy to transmit header informationo Additional buffers needed on the receiver sideo Additional logic to reorder flits of individual packets at receiver

• We partition network energy intobuffer energy, router energy, and link energy, each having static and dynamic components.

• Comparisons against non-adaptive and aggressive adaptive buffered routing algorithms (DO, MIN-AD, ROMM)

Evaluation Methodology

Page 86: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Evaluation – Synthetic Traces

• First, the bad news

• Uniform random injection

• BLESS has significantly lower saturation throughput compared to buffered baseline.

00.

080.

120.

16 0.2

0.24

0.28

0.32

0.36 0.

40.

440.

480

20

40

60

80

100FLIT-2

WORM-2

FLIT-1

Injection Rate (flits per cycle per node)

Avera

ge L

ate

ncy

BLESS BestBaseline

Page 87: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Evaluation – Homogenous Case Study

• milc benchmarks (moderately intensive)

• Perfect caches!

• Very little performance degradation with BLESS

(less than 4% in dense network)

• With router latency 1, BLESS can even outperform baseline (by ~10%)

• Significant energy improvements (almost 40%)

DO

ROMM

WORM-2

WORM-1

MIN-A

D

FLIT

-2

FLIT

-1 DO

ROMM

WORM-2

WORM-1

0

5

10

15

20

W-S

peed

up

4x4, 8x milc 4x4, 16x milc 8x8, 16x milc

0

0.2

0.4

0.6

0.8

1

1.2BufferEnergy LinkEnergy

En

erg

y (

norm

al-

ized

)

4x4, 16x milc 8x8, 16x milc4x4, 8x milc

Baseline BLESS RL=1

Page 88: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Evaluation – Homogenous Case Study

DO

ROMM

WORM-2

WORM-1

MIN-A

D

FLIT

-2

FLIT

-1 DO

ROMM

WORM-2

WORM-1

0

5

10

15

20

W-S

peed

up

4x4, 8x milc 4x4, 16x milc 8x8, 16x milc

0

0.2

0.4

0.6

0.8

1

1.2BufferEnergy LinkEnergy

En

erg

y (

norm

al-

ized

)

4x4, 8 8x milc 4x4, 16x milc 8x8, 16x milc

Baseline BLESS RL=1

• milc benchmarks (moderately intensive)

• Perfect caches!

• Very little performance degradation with BLESS

(less than 4% in dense network)

• With router latency 1, BLESS can even outperform baseline (by ~10%)

• Significant energy improvements (almost 40%)

Observations:

1) Injection rates not extremely highon average self-throttling!

2) For bursts and temporary hotspots, use network links as buffers!

Page 89: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Evaluation – Further Results

• BLESS increases buffer requirementat receiver by at most 2x overall, energy is still reduced

• Impact of memory latency

with real caches, very little slowdown! (at most 1.5%)

See paper for details…

DO

ROMM

WORM-2

WORM-1

MIN-A

D

FLIT

-2

FLIT

-1 DO

ROMM

WORM-2

WORM-1

0

5

10

15

20

W-S

peed

up

4x4, 8x matlab 4x4, 16x matlab 8x8, 16x matlab

Page 90: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

Evaluation – Further Results

• BLESS increases buffer requirementat receiver by at most 2x overall, energy is still reduced

• Impact of memory latency

with real caches, very little slowdown! (at most 1.5%)

• Heterogeneous application mixes

(we evaluate several mixes of intensive and non-intensive applications)

little performance degradation

significant energy savings in all cases

no significant increase in unfairness across different applications

• Area savings: ~60% of network area can be saved!

See paper for details…

Page 91: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Aggregate results over all 29 applications

Evaluation – Aggregate Results

Sparse Network Perfect L2 Realistic L2

Average Worst-Case Average Worst-Case

∆ Network Energy -39.4% -28.1% -46.4% -41.0%

∆ System Performance

-0.5% -3.2% -0.15% -0.55%

Mean

Worst-Case

0

0.2

0.4

0.6

0.8

1

BufferEnergy LinkEnergy

En

erg

y (

norm

al

-iz

ed

)

FLIT WORMBASE FLIT WORMBASE Mean

Worst-Case

0

2

4

6

8

W-S

peed

up

FLIT

WO

RM

BA

SE

FLI

T

WO

RM

BA

SE

Page 92: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• Aggregate results over all 29 applications

Evaluation – Aggregate Results

Sparse Network Perfect L2 Realistic L2

Average Worst-Case Average Worst-Case

∆ Network Energy -39.4% -28.1% -46.4% -41.0%

∆ System Performance

-0.5% -3.2% -0.15% -0.55%

Dense Network Perfect L2 Realistic L2

Average Worst-Case Average Worst-Case

∆ Network Energy -32.8% -14.0% -42.5% -33.7%

∆ System Performance

-3.6% -17.1% -0.7% -1.5%

Page 93: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

$

• For a very wide range of applications and network settings, buffers are not needed in NoC• Significant energy savings

(32% even in dense networks and perfect caches)• Area-savings of 60% • Simplified router and network design (flow control,

etc…)• Performance slowdown is minimal (can even

increase!)

A strong case for a rethinking of NoC design!

• Future research:• Support for quality of service, different traffic

classes, energy-management, etc…

BLESS Conclusions

Page 94: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

CHIPPER: A Low-complexityBufferless Deflection Router

Chris Fallin, Chris Craik, and Onur Mutlu,"CHIPPER: A Low-Complexity Bufferless Deflection Router"

Proceedings of the 17th International Symposium on High-Performance Computer Architecture

(HPCA), pages 144-155, San Antonio, TX, February 2011. Slides (pptx)

Page 95: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Motivation Recent work has proposed bufferless deflection

routing (BLESS [Moscibroda, ISCA 2009])

Energy savings: ~40% in total NoC energy Area reduction: ~40% in total NoC area Minimal performance loss: ~4% on average

Unfortunately: unaddressed complexities in router long critical path, large reassembly buffers

Goal: obtain these benefits while simplifying the router

in order to make bufferless NoCs practical. 95

Page 96: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Problems that Bufferless Routers Must Solve1. Must provide livelock freedom

A packet should not be deflected forever

2. Must reassemble packets upon arrival

96

Flit: atomic routing unit

0 1 2 3

Packet: one or multiple flits

Page 97: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Local Node

Router

Inject

DeflectionRoutingLogic

Crossbar

A Bufferless Router: A High-Level View

97

ReassemblyBuffers

Eject

Problem 2: Packet Reassembly

Problem 1: Livelock Freedom

Page 98: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Complexity in Bufferless Deflection Routers1. Must provide livelock freedom

Flits are sorted by age, then assigned in age order to output ports

43% longer critical path than buffered router

2. Must reassemble packets upon arrival

Reassembly buffers must be sized for worst case

4KB per node (8x8, 64-byte cache block)

98

Page 99: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Inject

DeflectionRoutingLogic

Crossbar

Problem 1: Livelock Freedom

99

ReassemblyBuffers

EjectProblem 1: Livelock Freedom

Page 100: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Livelock Freedom in Previous Work What stops a flit from deflecting forever?

All flits are timestamped Oldest flits are assigned their desired ports Total order among flits

But what is the cost of this?

100

Flit age forms total order

Guaranteedprogress!

< < <<<

New traffic is lowest priority

Page 101: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Age-Based Priorities are Expensive: Sorting Router must sort flits by age: long-latency sort

network Three comparator stages for 4 flits

101

4

1

2

3

Page 102: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Age-Based Priorities Are Expensive: Allocation After sorting, flits assigned to output ports in priority

order Port assignment of younger flits depends on that of

older flits sequential dependence in the port allocator

102

East? GRANT: Flit 1 East

DEFLECT: Flit 2 North

GRANT: Flit 3 South

DEFLECT: Flit 4 West

East?

{N,S,W}

{S,W}

{W}

South?

South?

Age-Ordered Flits

1

2

3

4

Page 103: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Age-Based Priorities Are Expensive Overall, deflection routing logic based on

Oldest-First has a 43% longer critical path than a buffered router

Question: is there a cheaper way to route while guaranteeing livelock-freedom?

103

Port AllocatorPriority Sort

Page 104: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Solution: Golden Packet for Livelock Freedom What is really necessary for livelock

freedom?

Key Insight: No total order. it is enough to:1. Pick one flit to prioritize until arrival2. Ensure any flit is eventually picked

104

Flit age forms total order

Guaranteedprogress!

New traffic islowest-priority

< < <

Guaranteedprogress!

<

“Golden Flit”

partial ordering is sufficient!

Page 105: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Only need to properly route the Golden Flit

First Insight: no need for full sort Second Insight: no need for sequential

allocation

What Does Golden Flit Routing Require?

105

Port AllocatorPriority Sort

Page 106: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Golden Flit Routing With Two Inputs Let’s route the Golden Flit in a two-input router first

Step 1: pick a “winning” flit: Golden Flit, else random

Step 2: steer the winning flit to its desired output and deflect other flit

Golden Flit always routes toward destination

106

Page 107: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Golden Flit Routing with Four Inputs

107

Each block makes decisions independently! Deflection is a distributed decision

N

E

S

W

N

S

E

W

Page 108: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Permutation Network Operation

108

N

E

S

W

wins swap!

wins no swap! wins no swap!

deflected

Golden:

wins swap!

x

Port AllocatorPriority Sort

N

E

S

W

N

S

E

W

Page 109: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Problem 2: Packet Reassembly

109

Inject/Eject

ReassemblyBuffers

Inject Eject

Page 110: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Reassembly Buffers are Large Worst case: every node sends a packet to one

receiver Why can’t we make reassembly buffers smaller?

110

Node 0

Node 1

Node N-1

Receiver

one packet in flight per node

N sending nodes …

O(N) space!

Page 111: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Small Reassembly Buffers Cause Deadlock

What happens when reassembly buffer is too small?

111

Network

cannot eject:reassemblybuffer full

reassemblybuffer

Many Senders

One Receiver

Remaining flitsmust inject forforward progress

cannot inject new traffic

network full

Page 112: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Reserve Space to Avoid Deadlock? What if every sender asks permission from the

receiver before it sends?

adds additional delay to every request

112

reassembly buffers

Reserve Slot?

Reserved

ACK

Sender

1. Reserve Slot2. ACK3. Send Packet

Receiver

Page 113: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Escaping Deadlock with Retransmissions Sender is optimistic instead: assume buffer is

free If not, receiver drops and NACKs; sender retransmits

no additional delay in best case transmit buffering overhead for all packets potentially many retransmits

113

ReassemblyBuffers

RetransmitBuffers

NACK!

Sender

ACK

Receiver

1. Send (2 flits)2. Drop, NACK3. Other packet completes4. Retransmit packet5. ACK6. Sender frees data

Page 114: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Solution: Retransmitting Only Once Key Idea: Retransmit only when space becomes

available. Receiver drops packet if full; notes which packet it

drops When space frees up, receiver reserves space so retransmit is successful Receiver notifies sender to retransmit

114

ReassemblyBuffers

RetransmitBuffers

NACK

Sender

Reserved

Receiver

Pending: Node 0 Req 0

Page 115: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Using MSHRs as Reassembly Buffers

115

Inject/Eject

ReassemblyBuffers

Inject Eject

Miss Buffers (MSHRs)

C Using miss buffers for reassembly makes this a

truly bufferless network.

Page 116: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Inject

DeflectionRoutingLogic

Crossbar

CHIPPER: Cheap Interconnect Partially-Permuting Router

116

ReassemblyBuffers

Eject

Baseline Bufferless Deflection Router

Large buffers for worst case

Retransmit-OnceCache buffers

Long critical path: 1. Sort by age 2. Allocate ports sequentially

Golden Packet Permutation Network

Page 117: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

CHIPPER: Cheap Interconnect Partially-Permuting Router

117

Inject/Eject

Miss Buffers (MSHRs)

Inject Eject

Page 118: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

EVALUATION

118

Page 119: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology Multiprogrammed workloads: CPU2006, server,

desktop 8x8 (64 cores), 39 homogeneous and 10 mixed sets

Multithreaded workloads: SPLASH-2, 16 threads 4x4 (16 cores), 5 applications

System configuration Buffered baseline: 2-cycle router, 4 VCs/channel, 8

flits/VC Bufferless baseline: 2-cycle latency, FLIT-BLESS

Instruction-trace driven, closed-loop, 128-entry OoO window

64KB L1, perfect L2 (stresses interconnect), XOR mapping

119

Page 120: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology Hardware modeling

Verilog models for CHIPPER, BLESS, buffered logic Synthesized with commercial 65nm library

ORION for crossbar, buffers and links

Power Static and dynamic power from hardware models Based on event counts in cycle-accurate simulations

120

Page 121: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

luc

chol

esky

radi

x fft lun

AVG

0

0.2

0.4

0.6

0.8

1

Multithreaded

Spee

dup

(Nor

mal

ized

)

perlb

ench

tont

o

gcc

h264

ref

vpr

sear

ch.1

MIX

.5

MIX

.2

MIX

.8

MIX

.0

MIX

.6

Gem

sFDT

D

stre

am mcf

AVG

(full

set)

0

8

16

24

32

40

48

56

64

Multiprogrammed (subset of 49 total) BufferedBLESSCHIPPER

Wei

ghte

d Sp

eedu

p

Results: Performance Degradation

121

13.6%1.8%

3.6% 49.8%

C Minimal loss for low-to-medium-intensity workloads

Page 122: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

perlb

ench

tont

o

gcc

h264

ref

vpr

sear

ch.1

MIX

.5

MIX

.2

MIX

.8

MIX

.0

MIX

.6

Gem

sFDT

D

stre

am mcf

AVG

(full

set)

0

2

4

6

8

10

12

14

16

18

Multiprogrammed (subset of 49 total)

Buffered BLESS

CHIPPER

Net

wor

k Po

wer

(W)

Results: Power Reduction

122

luc

chol

esky

radi

x fft lun

AVG

0

0.5

1

1.5

2

2.5Multithreaded

54.9%73.4%

C Removing buffers majority of power savings

C Slight savings from BLESS to CHIPPER

Page 123: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Results: Area and Critical Path Reduction

123

Buffered BLESS CHIPPER0

0.25

0.5

0.75

1

1.25

1.5

Normalized Router Area

Buffered BLESS CHIPPER0

0.25

0.5

0.75

1

1.25

1.5

Normalized Critical Path

-36.2%

-29.1%

+1.1%

-1.6%

C CHIPPER maintains area savings of BLESS

C Critical path becomes competitive to buffered

Page 124: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conclusions Two key issues in bufferless deflection routing

livelock freedom and packet reassembly

Bufferless deflection routers were high-complexity and impractical Oldest-first prioritization long critical path in router No end-to-end flow control for reassembly prone to deadlock

with reasonably-sized reassembly buffers

CHIPPER is a new, practical bufferless deflection router Golden packet prioritization short critical path in router Retransmit-once protocol deadlock-free packet reassembly Cache miss buffers as reassembly buffers truly bufferless

network

CHIPPER frequency comparable to buffered routers at much lower area and power cost, and minimal performance loss

124

Page 125: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

MinBD:Minimally-Buffered Deflection

Routingfor Energy-Efficient Interconnect

Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun, and Onur Mutlu,

"MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect"

Proceedings of the 6th ACM/IEEE International Symposium on Networks on Chip (NOCS),

Lyngby, Denmark, May 2012. Slides (pptx) (pdf)

Page 126: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Bufferless Deflection Routing Key idea: Packets are never buffered in the network.

When two packets contend for the same link, one is deflected.

Removing buffers yields significant benefits Reduces power (CHIPPER: reduces NoC power by 55%) Reduces die area (CHIPPER: reduces NoC area by 36%)

But, at high network utilization (load), bufferless deflection routing causes unnecessary link & router traversals Reduces network throughput and application performance Increases dynamic power

Goal: Improve high-load performance of low-cost deflection networks by reducing the deflection rate. 126

Page 127: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

127

Page 128: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

128

Page 129: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Issues in Bufferless Deflection Routing Correctness: Deliver all packets without livelock

CHIPPER1: Golden Packet Globally prioritize one packet until delivered

Correctness: Reassemble packets without deadlock

CHIPPER1: Retransmit-Once

Performance: Avoid performance degradation at high load

MinBD129

1 Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011.

Page 130: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Key Performance Issues1. Link contention: no buffers to hold traffic

any link contention causes a deflection use side buffers

2. Ejection bottleneck: only one flit can eject per router per cycle simultaneous arrival causes deflection

eject up to 2 flits/cycle

3. Deflection arbitration: practical (fast) deflection arbiters deflect unnecessarily

new priority scheme (silver flit)

130

Page 131: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

131

Page 132: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

132

Page 133: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Addressing Link Contention Problem 1: Any link contention causes a deflection

Buffering a flit can avoid deflection on contention But, input buffers are expensive:

All flits are buffered on every hop high dynamic energy

Large buffers necessary high static energy and large area

Key Idea 1: add a small buffer to a bufferless deflection router to buffer only flits that would have been deflected

133

Page 134: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

How to Buffer Deflected Flits

134

Baseline RouterEject Inject

1 Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011.

Destination

Destination

DEFLECTED

Page 135: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

How to Buffer Deflected Flits

135

Side-Buffered RouterEject Inject

Step 1. Remove up to one deflected flit per cycle from the outputs.

Step 2. Buffer this flit in a small FIFO “side buffer.”

Step 3. Re-inject this flit into pipeline when a slot is available.

Side Buffer

Destination

Destination

DEFLECTED

Page 136: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Why Could A Side Buffer Work Well? Buffer some flits and deflect other flits at per-flit

level

Relative to bufferless routers, deflection rate reduces(need not deflect all contending flits) 4-flit buffer reduces deflection rate by 39%

Relative to buffered routers, buffer is more efficiently used (need not buffer all flits)

similar performance with 25% of buffer space

136

Page 137: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

137

Page 138: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Addressing the Ejection Bottleneck Problem 2: Flits deflect unnecessarily because only

one flit can eject per router per cycle

In 20% of all ejections, ≥ 2 flits could have ejected all but one flit must deflect and try again

these deflected flits cause additional contention

Ejection width of 2 flits/cycle reduces deflection rate 21%

Key idea 2: Reduce deflections due to a single-flit ejection port by allowing two flits to eject per cycle

138

Page 139: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Addressing the Ejection Bottleneck

139

Single-Width EjectionEject Inject

DEFLECTED

Page 140: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Addressing the Ejection Bottleneck

140

Dual-Width EjectionEject Inject

For fair comparison, baseline routers have dual-width ejection for perf. (not power/area)

Page 141: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

141

Page 142: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Improving Deflection Arbitration Problem 3: Deflections occur unnecessarily

because fast arbiters must use simple priority schemes

Age-based priorities (several past works): full priority order gives fewer deflections, but requires slow arbiters

State-of-the-art deflection arbitration (Golden Packet & two-stage permutation network) Prioritize one packet globally (ensure forward

progress) Arbitrate other flits randomly (fast critical path)

Random common case leads to uncoordinated arbitration

142

Page 143: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Fast Deflection Routing Implementation Let’s route in a two-input router first:

Step 1: pick a “winning” flit (Golden Packet, else random)

Step 2: steer the winning flit to its desired output and deflect other flit

Highest-priority flit always routes to destination

143

Page 144: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Fast Deflection Routing with Four Inputs

144

Each block makes decisions independently Deflection is a distributed decision

N

E

S

W

N

S

E

W

Page 145: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Unnecessary Deflections in Fast Arbiters How does lack of coordination cause unnecessary

deflections?1. No flit is golden (pseudorandom arbitration)2. Red flit wins at first stage3. Green flit loses at first stage (must be deflected now)4. Red flit loses at second stage; Red and Green are

deflected

145

Destination

Destination

all flits haveequal priority

unnecessarydeflection!

Page 146: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Improving Deflection Arbitration Key idea 3: Add a priority level and prioritize

one flitto ensure at least one flit is not deflected in each cycle

Highest priority: one Golden Packet in network Chosen in static round-robin schedule Ensures correctness

Next-highest priority: one silver flit per router per cycle Chosen pseudo-randomly & local to one router Enhances performance

146

Page 147: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Adding A Silver Flit Randomly picking a silver flit ensures one flit is not

deflected1. No flit is golden but Red flit is silver2. Red flit wins at first stage (silver)3. Green flit is deflected at first stage4. Red flit wins at second stage (silver); not deflected

147

Destination

Destination

At least one flitis not deflected

red flit hashigher priorityall flits haveequal priority

Page 148: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Minimally-Buffered Deflection Router

148

Eject Inject

Problem 1: Link ContentionSolution 1: Side Buffer

Problem 2: Ejection BottleneckSolution 2: Dual-Width Ejection

Problem 3: Unnecessary DeflectionsSolution 3: Two-level priority scheme

Page 149: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

149

Page 150: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline: This Talk Motivation

Background: Bufferless Deflection Routing

MinBD: Reducing Deflections Addressing Link Contention Addressing the Ejection Bottleneck Improving Deflection Arbitration

Results Conclusions

150

Page 151: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology: Simulated System Chip Multiprocessor Simulation

64-core and 16-core models Closed-loop core/cache/NoC cycle-level model Directory cache coherence protocol (SGI Origin-based) 64KB L1, perfect L2 (stresses interconnect), XOR-

mapping Performance metric: Weighted Speedup

(similar conclusions from network-level latency) Workloads: multiprogrammed SPEC CPU2006

75 randomly-chosen workloads Binned into network-load categories by average injection

rate

151

Page 152: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology: Routers and Network Input-buffered virtual-channel router

8 VCs, 8 flits/VC [Buffered(8,8)]: large buffered router 4 VCs, 4 flits/VC [Buffered(4,4)]: typical buffered router 4 VCs, 1 flit/VC [Buffered(4,1)]: smallest deadlock-free

router All power-of-2 buffer sizes up to (8, 8) for perf/power

sweep Bufferless deflection router: CHIPPER1

Bufferless-buffered hybrid router: AFC2

Has input buffers and deflection routing logic Performs coarse-grained (multi-cycle) mode switching

Common parameters 2-cycle router latency, 1-cycle link latency 2D-mesh topology (16-node: 4x4; 64-node: 8x8) Dual ejection assumed for baseline routers (for perf.

only)152

1Fallin et al., “CHIPPER: A Low-complexity Bufferless Deflection Router”, HPCA 2011.2Jafri et al., “Adaptive Flow Control for Robust Performance and Energy”, MICRO 2010.

Page 153: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology: Power, Die Area, Crit. Path Hardware modeling

Verilog models for CHIPPER, MinBD, buffered control logic Synthesized with commercial 65nm library

ORION 2.0 for datapath: crossbar, muxes, buffers and links

Power Static and dynamic power from hardware models Based on event counts in cycle-accurate simulations Broken down into buffer, link, other

153

Page 154: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Deflection

Reduced Deflections & Improved Perf.

154

12

12.5

13

13.5

14

14.5

15

BaselineB (Side-Buf)D (Dual-Eject)S (Silver Flits)B+DB+S+D (MinBD)

Wei

ghte

d Sp

eedu

p

(Side Buffer)

Rate28% 17% 22% 27% 11% 10%

1. All mechanisms individually reduce deflections

2. Side buffer alone is not sufficient for performance(ejection bottleneck remains)

3. Overall, 5.8% over baseline, 2.7% over dual-ejectby reducing deflections 64% / 54%

5.8%2.7%

Page 155: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Overall Performance Results

155

0.0 - 0.15

0.15 - 0.30

0.30 - 0.40

0.40 - 0.50

> 0.50AVG

8

10

12

14

16

Buffered (8,8)Buffered (4,4)Buffered (4,1)CHIPPERAFC (4,4)MinBD-4

Injection Rate

Wei

ghte

d Sp

eedu

p

• Improves 2.7% over CHIPPER (8.1% at high load)• Similar perf. to Buffered (4,1) @ 25% of buffering space

2.7%

8.1%

2.7%

8.3%

• Within 2.7% of Buffered (4,4) (8.3% at high load)

Page 156: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Buffe

re...

Buffe

re...

Buffe

re...

CHIP

PER

AFC(

4,4)

Min

BD-4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

dynamic static

Net

wor

k Po

wer

(W)

Buffe

red

...

Buffe

red

...

Buffe

red

...

CHIP

PER

AFC(

4,4)

Min

BD-4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

non-buffer buffer

Net

wor

k Po

wer

(W)

Buffe

re...

Buffe

re...

Buffe

re...

CHIP

PER

AFC(

4,4)

Min

BD-4

0.0

0.5

1.0

1.5

2.0

2.5

3.0 dynamic other dynamic link dynamic bufferstatic other static link static buffer

Net

wor

k Po

wer

(W)

Overall Power Results

156

• Buffers are significant fraction of power in baseline routers

• Buffer power is much smaller in MinBD (4-flit buffer)

• Dynamic power increases with deflection routing

• Dynamic power reduces in MinBD relative to CHIPPER

Page 157: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Performance-Power Spectrum

157

0.5 1.0 1.5 2.0 2.5 3.013.013.213.413.613.814.014.214.414.614.815.0

Buf (1,1)

Network Power (W)

Wei

ghte

d Sp

eedu

p

• Most energy-efficient (perf/watt) of any evaluated network router design

Buf (4,4)

Buf (4,1)

More Perf/Power Less Perf/Power

Buf (8,8)

AFC

CHIPPER

MinBD

Page 158: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Buffe

red

(8,8

)

Buffe

red

(4,4

)

Buffe

red

(4,1

)

CHIP

PER

Min

BD0

0.5

1

1.5

2

2.5

Normalized Die Area

Buffe

red

(8,8

)

Buffe

red

(4,4

)

Buffe

red

(4,1

)

CHIP

PER

Min

BD

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Normalized Critical Path

Die Area and Critical Path

158

• Only 3% area increase over CHIPPER (4-flit buffer)• Reduces area by 36% from Buffered (4,4)• Increases by 7% over CHIPPER, 8% over Buffered

(4,4)

+3%

-36%

+7%+8%

Page 159: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conclusions Bufferless deflection routing offers reduced power & area But, high deflection rate hurts performance at high load

MinBD (Minimally-Buffered Deflection Router) introduces: Side buffer to hold only flits that would have been deflected Dual-width ejection to address ejection bottleneck Two-level prioritization to avoid unnecessary deflections

MinBD yields reduced power (31%) & reduced area (36%)relative to buffered routers

MinBD yields improved performance (8.1% at high load)relative to bufferless routers closes half of perf. gap

MinBD has the best energy efficiency of all evaluated designs with competitive performance

159

Page 160: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

160

More Readings Studies of congestion and congestion control in on-

chip vs. internet-like networks

George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan,"On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects"

Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August 2012. Slides (pptx)

George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu,"Next Generation On-Chip Networks: What Kind of Congestion Control Do We Need?" Proceedings of the 9th ACM Workshop on Hot Topics in Networks (HOTNETS), Monterey, CA, October 2010. Slides (ppt) (key)

Page 161: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks

Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu,"HAT: Heterogeneous Adaptive Throttling for On-Chip Networks"

Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing

(SBAC-PAD), New York, NY, October 2012. Slides (pptx) (pdf)

Page 162: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Executive Summary• Problem: Packets contend in on-chip networks (NoCs),

causing congestion, thus reducing performance• Observations:

1) Some applications are more sensitive to network latency than others2) Applications must be throttled differently to achieve peak performance

• Key Idea: Heterogeneous Adaptive Throttling (HAT)1) Application-aware source throttling 2) Network-load-aware throttling rate adjustment

• Result: Improves performance and energy efficiency over state-of-the-art source throttling policies

162

Page 163: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline• Background and Motivation• Mechanism• Prior Works• Results

163

Page 164: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

On-Chip Networks

164

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R Router

Processing Element(Cores, L2 Banks, Memory Controllers, etc)PE

• Connect cores, caches, memory controllers, etc

• Packet switched• 2D mesh: Most commonly used topology• Primarily serve cache misses and

memory requests• Router designs

– Buffered: Input buffers to hold contending packets

– Bufferless: Misroute (deflect)contending packets

Page 165: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Network Congestion Reduces Performance

165

Network congestion:Network throughput Application performanceR

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

P P P

P

P

Congestion

R RouterProcessing Element(Cores, L2 Banks, Memory Controllers, etc)PE

P Packet

Limited shared resources (buffers and links)• Design constraints:

power, chip area, and timing

Page 166: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Goal• Improve performance in a highly congested NoC

• Reducing network load decreases network congestion, hence improves performance

• Approach: source throttling to reduce network load– Temporarily delay new traffic injection

• Naïve mechanism: throttle every single node

166

Page 167: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

mcf gromacs system0.0

0.2

0.4

0.6

0.8

1.0

1.2

Throttle gromacsThrottle mcf

Nor

mal

ized

Pe

rfor

man

ceKey Observation #1

167

gromacs: network-non-intensive

+ 9%- 2%

Different applications respond differently to changes in network latency

mcf: network-intensive

Throttling mcf reduces congestiongromacs is more sensitive to network latencyThrottling network-intensive applications benefits system performance more

Page 168: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

80 82 84 86 88 90 92 94 96 98 1006789

10111213141516

Workload 1Workload 2Workload 3

Throttling Rate (%)

Perf

orm

ance

(W

eigh

ted

Spee

dup)

Key Observation #2

168

Different workloads achieve peak performance at different throttling rates

Dynamically adjusting throttling rate yields better performance than a single static rate

90% 92%

94%

Page 169: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline• Background and Motivation• Mechanism• Prior Works• Results

169

Page 170: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Heterogeneous Adaptive Throttling (HAT)1. Application-aware throttling:

Throttle network-intensive applications that interfere with network-non-intensive applications

2. Network-load-aware throttling rate adjustment:Dynamically adjusts throttling rate to adapt to different workloads

170

Page 171: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Heterogeneous Adaptive Throttling (HAT)1. Application-aware throttling:

Throttle network-intensive applications that interfere with network-non-intensive applications

2. Network-load-aware throttling rate adjustment:Dynamically adjusts throttling rate to adapt to different workloads

171

Page 172: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Application-Aware Throttling1. Measure Network Intensity

Use L1 MPKI (misses per thousand instructions) to estimate network intensity

2. Classify ApplicationSort applications by L1 MPKI

3. Throttle network-intensive applications

172

App

Σ MPKI < NonIntensiveCap

Network-non-intensive Network-intensive

Higher L1 MPKI

Throttle

Page 173: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Heterogeneous Adaptive Throttling (HAT)1. Application-aware throttling:

Throttle network-intensive applications that interfere with network-non-intensive applications

2. Network-load-aware throttling rate adjustment:Dynamically adjusts throttling rate to adapt to different workloads

173

Page 174: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dynamic Throttling Rate Adjustment

• For a given network design, peak performance tends to occur at a fixed network load point

• Dynamically adjust throttling rate to achieve that network load point

174

Page 175: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dynamic Throttling Rate Adjustment• Goal: maintain network load at a peak

performance point

1. Measure network load2. Compare and adjust throttling rate

If network load > peak point: Increase throttling rate

elif network load ≤ peak point: Decrease throttling rate

175

Page 176: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Epoch-Based Operation• Continuous HAT operation is expensive• Solution: performs HAT at epoch granularity

176

Time

Current Epoch(100K cycles)

Next Epoch(100K cycles)

During epoch:1) Measure L1 MPKI

of each application2) Measure network

load

Beginning of epoch:1) Classify applications2) Adjust throttling rate3) Reset measurements

Page 177: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline• Background and Motivation• Mechanism• Prior Works• Results

177

Page 178: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Prior Source Throttling Works• Source throttling for bufferless NoCs

[Nychis+ Hotnets’10, SIGCOMM’12]

– Application-aware throttling based on starvation rate– Does not adaptively adjust throttling rate– “Heterogeneous Throttling”

• Source throttling off-chip buffered networks [Thottethodi+ HPCA’01]

– Dynamically trigger throttling based on fraction of buffer occupancy

– Not application-aware: fully block packet injections of every node

– “Self-tuned Throttling”

178

Page 179: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline• Background and Motivation• Mechanism• Prior Works• Results

179

Page 180: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Methodology• Chip Multiprocessor Simulator

– 64-node multi-core systems with a 2D-mesh topology– Closed-loop core/cache/NoC cycle-level model– 64KB L1, perfect L2 (always hits to stress NoC)

• Router Designs– Virtual-channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92]– Bufferless deflection routers: BLESS [Moscibroda+ ISCA’09]

• Workloads– 60 multi-core workloads: SPEC CPU2006 benchmarks– Categorized based on their network intensity

• Low/Medium/High intensity categories

• Metrics: Weighted Speedup (perf.), perf./Watt (energy eff.),and maximum slowdown (fairness)

180

Page 181: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

HL HML HM H amean05

101520253035404550

BLESSHetero.HAT

Workload Categories

Wei

ghte

d Sp

eedu

pPerformance: Bufferless NoC (BLESS)

181

HAT provides better performance improvement than past workHighest improvement on heterogeneous workload mixes- L and M are more sensitive to network latency

7.4%

Page 182: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

HL HML HM H amean05

101520253035404550

BufferedSelf-TunedHetero.HAT

Workload Categories

Wei

ghte

d Sp

eedu

pPerformance: Buffered NoC

182

Congestion is much lower in Buffered NoC, but HAT still provides performance benefit

+ 3.5%

Page 183: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Application Fairness

183

HAT provides better fairness than prior works

amean0.0

0.2

0.4

0.6

0.8

1.0

1.2

BLESS Hetero. HAT

Nor

mal

ized

Max

imum

Slo

wdw

on

amean0.0

0.2

0.4

0.6

0.8

1.0

1.2

Buffered Self-TunedHetero. HAT

- 15%- 5%

Page 184: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

BLESS Buffered0.0

0.2

0.4

0.6

0.8

1.0

1.2

Baseline

HAT

Nor

mal

ized

Per

f. pe

r Wat

Network Energy Efficiency

184

8.5% 5%

HAT increases energy efficiency by reducing congestion

Page 185: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Other Results in Paper

• Performance on CHIPPER

• Performance on multithreaded workloads

• Parameters sensitivity sweep of HAT

185

Page 186: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conclusion• Problem: Packets contend in on-chip networks (NoCs),

causing congestion, thus reducing performance• Observations:

1) Some applications are more sensitive to network latency than others2) Applications must be throttled differently to achieve peak performance

• Key Idea: Heterogeneous Adaptive Throttling (HAT)1) Application-aware source throttling 2) Network-load-aware throttling rate adjustment

• Result: Improves performance and energy efficiency over state-of-the-art source throttling policies

186

Page 187: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Application-Aware Packet Scheduling

Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das,"Application-Aware Prioritization Mechanisms for On-Chip Networks"

Proceedings of the 42nd International Symposium on Microarchitecture (MICRO), pages 280-291, New York, NY, December 2009. Slides (pptx)

Page 188: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

The Problem: Packet Scheduling

188

Network-on-Chip

L2$L2$L2$L2$

BankmemcontMemory

Controller

P

AcceleratorL2$

BankL2$

Bank

P P P P P P P

On-chip Network

App1 App2 App N+1App N

On-chip Network is a critical resource shared by multiple applications

Page 189: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

From East

From West

From North

From South

From PE

VC 0

VC Identifier

VC 1

VC 2

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

R

PE

Crossbar (5 x 5)

To East

To PE

To West

To North

To South

Input Port with Buffers

Control Logic

Crossbar

R Routers

PEProcessing Element(Cores, L2 Banks, Memory Controllers etc)

Routing Unit (RC)

VC Allocator(VA)

Switch Allocator (S

A)

The Problem: Packet Scheduling

189

Page 190: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

VC 0 Routing Unit (RC)

VC Allocator(VA)

Switch Allocator (S

A)

VC 1

VC 2

From East

From West

From North

From South

From PE

The Problem: Packet Scheduling

190

Page 191: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

The Problem: Packet Scheduling

191

Conceptual

View

From East

From West

From North

From South

From PE

VC 0

VC 1

VC 2

App1 App2 App3 App4App5 App6 App7 App8

VC 0 Routing Unit (RC)

VC Allocator(VA)

Switch

VC 1

VC 2

From East

From West

From North

From South

From PE

Allocator (SA)

Page 192: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

The Problem: Packet Scheduling

192

Sch

edu

ler

Conceptual

View

VC 0 Routing Unit (RC)

VC Allocator(VA)

Switch Allocator (SA

)

VC 1

VC 2

From East

From West

From North

From South

From PE

From East

From West

From North

From South

From PE

VC 0

VC 1

VC 2

App1 App2 App3 App4App5 App6 App7 App8

Which packet to choose?

Page 193: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

The Problem: Packet Scheduling

Existing scheduling policies Round Robin Age

Problem 1: Local to a router Lead to contradictory decision making between

routers: packets from one application may be prioritized at one router, to be delayed at next.

Problem 2: Application oblivious Treat all applications packets equally But applications are heterogeneous

Solution: Application-aware global scheduling policies.

193

Page 194: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Motivation: Stall-Time Criticality Applications are not homogenous

Applications have different criticality with respect to the network Some applications are network latency sensitive Some applications are network latency tolerant

Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i.e. NST/packet) Network Stall Time (NST) is number of cycles the

processor stalls waiting for network transactions to complete

194

Page 195: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Motivation: Stall-Time Criticality Why do applications have different network stall

time criticality (STC)?

Memory Level Parallelism (MLP) Lower MLP leads to higher criticality

Shortest Job First Principle (SJF) Lower network load leads to higher criticality

195

Page 196: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Principle 1: MLP

Observation 1: Packet Latency != Network Stall Time

196

STALLSTALL

STALL of Red Packet = 0

LATENCYLATENCY

LATENCY

Application with high MLP

Compute

Page 197: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Principle 1: MLP

Observation 1: Packet Latency != Network Stall Time

Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s

197

STALLSTALL

STALL of Red Packet = 0

LATENCYLATENCY

LATENCY

Application with high MLP

Compute

STALL

LATENCY

STALL

LATENCY

STALL

LATENCY

Application with low MLP

Page 198: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Principle 2: Shortest-Job-First

198

4X network slow down

1.2X network slow down

1.3X network slow down

1.6X network slow down

Overall system throughput (weighted speedup) increases by 34%

Running ALONE

Baseline (RR) Scheduling

SJF Scheduling

Light Application Heavy Application

Page 199: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Solution: Application-Aware Policies Idea

Identify critical applications (i.e. network sensitive applications) and prioritize their packets in each router.

Key components of scheduling policy: Application Ranking Packet Batching

Propose low-hardware complexity solution

199

Page 200: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Component 1: Ranking Ranking distinguishes applications based on Stall

Time Criticality (STC) Periodically rank applications based on STC

Explored many heuristics for estimating STC Heuristic based on outermost private cache Misses Per

Instruction (L1-MPI) is the most effective Low L1-MPI => high STC => higher rank

Why Misses Per Instruction (L1-MPI)? Easy to Compute (low complexity) Stable Metric (unaffected by interference in network)

200

Page 201: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Component 1 : How to Rank? Execution time is divided into fixed “ranking intervals”

Ranking interval is 350,000 cycles At the end of an interval, each core calculates their

L1-MPI and sends it to the Central Decision Logic (CDL) CDL is located in the central node of mesh

CDL forms a rank order and sends back its rank to each core Two control packets per core every ranking interval

Ranking order is a “partial order”

Rank formation is not on the critical path Ranking interval is significantly longer than rank computation

time Cores use older rank values until new ranking is available 201

Page 202: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Component 2: Batching Problem: Starvation

Prioritizing a higher ranked application can lead to starvation of lower ranked application

Solution: Packet Batching Network packets are grouped into finite sized batches Packets of older batches are prioritized over

younger batches

Time-Based Batching New batches are formed in a periodic, synchronous

manner across all nodes in the network, every T cycles

202

Page 203: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Putting it all together: STC Scheduling Policy Before injecting a packet into the network, it is

tagged with Batch ID (3 bits) Rank ID (3 bits)

Three tier priority structure at routers Oldest batch first(prevent starvation) Highest rank first (maximize performance) Local Round-Robin (final tie breaker)

Simple hardware support: priority arbiters Global coordinated scheduling

Ranking order and batching order are same across all routers

203

Page 204: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Scheduling Example

204

48

5

17

2

1

6 2

1

3

Router

Sch

edu

ler

Inje

ctio

n C

ycle

s

1

2

3

4

5

6

7

8

2 2

3

Batch 2

Batch 1

Batch 0

Applications

Page 205: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Scheduling Example

205

48

5

17

3

2

6 2

2

3

Router

Sch

edu

ler

Round Robin3 2 8 7 6

STALL CYCLES Avg

RR 8 6 11 8.3

Age

STC

Time

Page 206: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Scheduling Example

206

48

5

17

3

2

6 2

2

3

Router

Sch

edu

ler

Round Robin5 4 3 1 2 2 3 2 8 7 6

Age

3 3 5 4 6 7 8

STALL CYCLES Avg

RR 8 6 11 8.3

Age 4 6 11 7.0

STC

Time

Time

Page 207: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Scheduling Example

207

48

5

17

3

2

6 2

2

3

Router

Sch

edu

ler

Round Robin5 4 3 1 2 2 3 2 8 7 6

Age

2 3 3 5 4 6 7 81 2 2

STC

3 5 4 6 7 8

STALL CYCLES Avg

RR 8 6 11 8.3

Age 4 6 11 7.0

STC 1 3 11 5.0

Time

Time

Time

Rank order

Page 208: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC Evaluation Methodology 64-core system

x86 processor model based on Intel Pentium M 2 GHz processor, 128-entry instruction window 32KB private L1 and 1MB per core shared L2 caches, 32 miss

buffers 4GB DRAM, 320 cycle access latency, 4 on-chip DRAM

controllers

Detailed Network-on-Chip model 2-stage routers (with speculation and look ahead routing) Wormhole switching (8 flit data packets) Virtual channel flow control (6 VCs, 5 flit buffer depth) 8x8 Mesh (128 bit bi-directional channels)

Benchmarks Multiprogrammed scientific, server, desktop workloads (35

applications) 96 workload combinations 208

Page 209: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Comparison to Previous Policies Round Robin & Age (Oldest-First)

Local and application oblivious Age is biased towards heavy applications

heavy applications flood the network higher likelihood of an older packet being from heavy

application

Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008] Provides bandwidth fairness at the expense of system

performance Penalizes heavy and bursty applications

Each application gets equal and fixed quota of flits (credits) in each batch.

Heavy application quickly run out of credits after injecting into all active batches & stalls until oldest batch completes and frees up fresh credits.

Underutilization of network resources

209

Page 210: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

STC System Performance and Fairness

9.1% improvement in weighted speedup over the best existing policy (averaged across 96 workloads)

210

0.0

0.2

0.4

0.6

0.8

1.0

1.2 LocalRR LocalAgeGSF STC

Norm

alized

S

yste

m

Sp

eed

up

0

1

2

3

4

5

6

7

8

9

10LocalRR LocalAgeGSF STC

Netw

ork

Un

fair

ness

Page 211: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Enforcing Operating System Priorities Existing policies cannot enforce operating system

(OS) assigned priorities in Network-on-Chip Proposed framework can enforce OS assigned

priorities Weight of applications => Ranking of applications Configurable batching interval based on application weight

211

02468

101214161820 xalan-1

xalan-2

xalan-3

xalan-4

xalan-5

xalan-6

xalan-7

xalan-8

Netw

ork

Slo

wd

ow

n

L LGS ST

02468

10121416182022

xalan-weight-1 leslie-weight-2lbm-weight-2 tpcw-weight-8

Netw

ork

Slo

wd

ow

n

W. Speedup 0.49 0.49 0.46 0.52 W. Speedup 0.46 0.44 0.27 0.43

Page 212: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

© Onur Mutlu, 2009, 2010

Application Aware Packet Scheduling: Summary Packet scheduling policies critically impact

performance and fairness of NoCs Existing packet scheduling policies are local and

application oblivious

STC is a new, global, application-aware approach to packet scheduling in NoCs Ranking: differentiates applications based on their

criticality Batching: avoids starvation due to rank-based

prioritization

Proposed framework provides higher system performance and fairness than

existing policies can enforce OS assigned priorities in network-on-chip 212

Page 213: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Slack-Driven Packet Scheduling

Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das,"Aergia: Exploiting Packet Latency Slack in On-Chip Networks"

Proceedings of the 37th International Symposium on Computer Architecture (ISCA), pages 106-116, Saint-Malo, France, June 2010. Slides (pptx)

Page 214: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Packet Scheduling in NoC Existing scheduling policies

Round robin Age

Problem Treat all packets equally Application-oblivious

Packets have different criticality Packet is critical if latency of a packet affects

application’s performance Different criticality due to memory level

parallelism (MLP)

All packets are not the same…!!!

Page 215: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Latency ( )

MLP Principle

Stall

Compute

Latency ( )

Latency ( )

Stall ( ) = 0

Packet Latency != Network Stall Time

Different Packets have different criticality due to MLP

Criticality( ) > Criticality( ) > Criticality( )

Page 216: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline

Introduction Packet Scheduling Memory Level Parallelism

Aergia Concept of Slack Estimating Slack

Evaluation Conclusion

Page 217: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

What is Aergia?

Aergia is the spirit of laziness in Greek mythology

Some packets can afford to slack!

Page 218: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline

Introduction Packet Scheduling Memory Level Parallelism

Aergia Concept of Slack Estimating Slack

Evaluation Conclusion

Page 219: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Slack of Packets

What is slack of a packet? Slack of a packet is number of cycles it can be delayed

in a router without (significantly) reducing application’s performance

Local network slack

Source of slack: Memory-Level Parallelism (MLP) Latency of an application’s packet hidden from

application due to overlap with latency of pending cache miss requests

Prioritize packets with lower slack

Page 220: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Concept of Slack Instruction Window

Stall

Network-on-Chip

Load Miss Causes

returns earlier than necessary

Compute

Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops

Execution Time

Packet( ) can be delayed for available slack cycles without reducing performance!

Causes Load Miss

Latency ( )

Latency ( )

SlackSlack

Page 221: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Prioritizing using Slack

Core A

Core B

Packet

Latency

Slack

13 hops

0 hops

3 hops

10 hops

10 hops

0 hops

4 hops

6 hops

Causes

CausesLoad Miss

Load Miss

Prioritize

Load Miss

Load Miss Causes

Causes

Interference at 3 hops

Slack( ) > Slack ( )

Page 222: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Slack in Applications

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100

Gems

Slack in cycles

Pe

rce

nta

ge

of

all

Pa

cke

ts (

%)

50% of packets have 350+ slack cycles

10% of packets have <50 slack cycles

Non-critical

critical

Page 223: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Slack in Applications

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100

Gems

art

Slack in cycles

Pe

rce

nta

ge

of

all

Pa

cke

ts (

%)

68% of packets have zero slack cycles

Page 224: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Diversity in Slack

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100Gems

omnet

tpcw

mcf

bzip2

sjbb

sap

sphinx

deal

barnes

astar

calculix

art

libquantum

sjeng

h264refSlack in cycles

Pe

rce

nta

ge

of

all

Pa

cke

ts (

%)

Page 225: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Diversity in Slack

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100Gems

omnet

tpcw

mcf

bzip2

sjbb

sap

sphinx

deal

barnes

astar

calculix

art

libquantum

sjeng

h264refSlack in cycles

Pe

rce

nta

ge

of

all

Pa

cke

ts (

%)Slack varies between packets of different

applications

Slack varies between packets of a single application

Page 226: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline

Introduction Packet Scheduling Memory Level Parallelism

Aergia Concept of Slack Estimating Slack

Evaluation Conclusion

Page 227: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Estimating Slack PrioritySlack (P) = Max (Latencies of P’s

Predecessors) – Latency of P

Predecessors(P) are the packets of outstanding cache miss requests when P is issued

Packet latencies not known when issued

Predicting latency of any packet Q Higher latency if Q corresponds to an L2 miss Higher latency if Q has to travel farther

number of hops

Page 228: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Slack of P = Maximum Predecessor Latency – Latency of P

Slack(P) =

PredL2: Set if any predecessor packet is servicing L2 miss

MyL2: Set if P is NOT servicing an L2 miss

HopEstimate: Max (# of hops of Predecessors) – hops of P

Estimating Slack Priority

PredL2(2 bits)

MyL2(1

bit)

HopEstimate(2 bits)

Page 229: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Estimating Slack Priority How to predict L2 hit or miss at core?

Global Branch Predictor based L2 Miss Predictor Use Pattern History Table and 2-bit saturating

counters Threshold based L2 Miss Predictor

If #L2 misses in “M” misses >= “T” threshold then next load is a L2 miss.

Number of miss predecessors? List of outstanding L2 Misses

Hops estimate? Hops => ∆X + ∆ Y distance Use predecessor list to calculate slack hop

estimate

Page 230: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Starvation Avoidance Problem: Starvation

Prioritizing packets can lead to starvation of lower priority packets

Solution: Time-Based Packet Batching New batches are formed at every T cycles

Packets of older batches are prioritized over younger batches

Page 231: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Putting it all together Tag header of the packet with priority bits

before injection

Priority(P)? P’s batch (highest

priority) P’s Slack Local Round-Robin

(final tie breaker)

PredL2(2

bits)

MyL2(1

bit)

HopEstimate

(2 bits)

Batch(3

bits)

Priority (P) =

Page 232: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Outline

Introduction Packet Scheduling Memory Level Parallelism

Aergia Concept of Slack Estimating Slack

Evaluation Conclusion

Page 233: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Evaluation Methodology 64-core system

x86 processor model based on Intel Pentium M 2 GHz processor, 128-entry instruction window 32KB private L1 and 1MB per core shared L2 caches,

32 miss buffers 4GB DRAM, 320 cycle access latency, 4 on-chip DRAM

controllers Detailed Network-on-Chip model

2-stage routers (with speculation and look ahead routing)

Wormhole switching (8 flit data packets) Virtual channel flow control (6 VCs, 5 flit buffer depth) 8x8 Mesh (128 bit bi-directional channels)

Benchmarks Multiprogrammed scientific, server, desktop workloads

(35 applications) 96 workload combinations

Page 234: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Qualitative Comparison Round Robin & Age

Local and application oblivious Age is biased towards heavy applications

Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]

Provides bandwidth fairness at the expense of system performance

Penalizes heavy and bursty applications Application-Aware Prioritization Policies

(SJF) [Das et al., MICRO 2009]

Shortest-Job-First Principle Packet scheduling policies which prioritize

network sensitive applications which inject lower load

Page 235: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

System Performance

SJF provides 8.9% improvementin weighted speedup

Aergia improves system throughput by 10.3%

Aergia+SJF improves system throughput by 16.1%

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

Age RRGSF SJFAergia SJF+Aergia

Norm

alized

Syste

m S

peed

up

Page 236: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Network Unfairness

SJF does not imbalance network fairness

Aergia improves networkunfairness by 1.5X

SJF+Aergia improves network unfairness by 1.3X

0.0

3.0

6.0

9.0

12.0

Age RR GSF

SJF Aergia SJF+Aergia

Netw

ork

Un

fair

ness

Page 237: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conclusions & Future Directions Packets have different criticality, yet existing

packet scheduling policies treat all packets equally

We propose a new approach to packet scheduling in NoCs We define Slack as a key measure that

characterizes the relative importance of a packet.

We propose Aergia a novel architecture to accelerate low slack critical packets

Result Improves system performance: 16.1% Improves network fairness: 30.8%

Page 238: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Express-Cube Topologies

Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu,"Express Cube Topologies for On-Chip Interconnects"

Proceedings of the 15th International Symposium on High-Performance Computer Architecture

(HPCA), pages 163-174, Raleigh, NC, February 2009. Slides (ppt)

Page 239: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 239HPCA '09

2-D Mesh

Page 240: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Pros Low design & layout

complexity Simple, fast routers

Cons Large diameter Energy & latency

impact

UTCS 240HPCA '09

2-D Mesh

Page 241: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Pros Multiple terminals

attached to a router node Fast nearest-neighbor

communication via the crossbar

Hop count reduction proportional to concentration degree

Cons Benefits limited by

crossbar complexity

UTCS 241HPCA '09

Concentration (Balfour & Dally, ICS ‘06)

Page 242: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 242HPCA '09

Concentration

Side-effects Fewer channels Greater channel width

Page 243: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 243HPCA ‘09

Replication

CMesh-X2

Benefits Restores bisection

channel count Restores channel width Reduced crossbar

complexity

Page 244: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 244HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Objectives: Improve connectivity Exploit the wire budget

Page 245: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 245HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Page 246: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 246HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Page 247: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 247HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Page 248: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 248HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Page 249: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Pros Excellent connectivity Low diameter: 2 hops

Cons High channel count:

k2/2 per row/column Low channel utilization Increased control

(arbitration) complexity

UTCS 249HPCA '09

Flattened Butterfly (Kim et al., Micro ‘07)

Page 250: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 250HPCA '09

Multidrop Express Channels (MECS)

Objectives: Connectivity More scalable channel

count Better channel

utilization

Page 251: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 251HPCA '09

Multidrop Express Channels (MECS)

Page 252: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 252HPCA '09

Multidrop Express Channels (MECS)

Page 253: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 253HPCA '09

Multidrop Express Channels (MECS)

Page 254: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 254HPCA '09

Multidrop Express Channels (MECS)

Page 255: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 255HPCA ‘09

Multidrop Express Channels (MECS)

Page 256: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Pros One-to-many topology Low diameter: 2 hops k channels row/column Asymmetric

Cons Asymmetric Increased control

(arbitration) complexity

UTCS 256HPCA ‘09

Multidrop Express Channels (MECS)

Page 257: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Partitioning: a GEC Example

UTCS 257HPCA '09

MECS

MECS-X2

FlattenedButterfly

PartitionedMECS

Page 258: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Analytical Comparison

UTCS 258HPCA '09

CMesh FBfly MECS

Network Size 64 256 64 256 64 256

Radix (conctr’d) 4 8 4 8 4 8

Diameter 6 14 2 2 2 2

Channel count 2 2 8 32 4 8

Channel width 576 1152 144 72 288 288

Router inputs 4 4 6 14 6 14

Router outputs 4 4 6 14 4 4

Page 259: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Experimental Methodology

Topologies Mesh, CMesh, CMesh-X2, FBFly, MECS, MECS-X2

Network sizes 64 & 256 terminals

Routing DOR, adaptive

Messages 64 & 576 bits

Synthetic traffic Uniform random, bit complement, transpose, self-similar

PARSECbenchmarks

Blackscholes, Bodytrack, Canneal, Ferret, Fluidanimate, Freqmine, Vip, x264

Full-system config M5 simulator, Alpha ISA, 64 OOO cores

Energy evaluation Orion + CACTI 6

UTCS 259HPCA '09

Page 260: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 260HPCA '09

64 nodes: Uniform Random

0

10

20

30

40

1 4 7 10 13 16 19 22 25 28 31 34 37 40

Late

ncy

(cyc

les)

injection rate (%)

mesh cmesh cmesh-x2 fbfly mecs mecs-x2

Page 261: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 261HPCA '09

256 nodes: Uniform Random

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25

Late

ncy

(cyc

les)

Injection rate (%)

mesh cmesh-x2 fbfly mecs mecs-x2

Page 262: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 262HPCA '09

Energy (100K pkts, Uniform Random)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Ave

rage

pac

ket e

ne

rgy

(nJ) Link Energy Router Energy

64 nodes 256 nodes

Page 263: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

UTCS 263HPCA '09

64 Nodes: PARSEC

0

2

4

6

8

10

12

14

16

18

20

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Router Energy Link Energy latency

Blackscholes Canneal Vip

Tota

l ne

two

rk E

ne

rgy

(J)

Avg

pac

ket

late

ncy

(cyc

les)

x264

Page 264: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Summary MECS

A new one-to-many topology Good fit for planar substrates Excellent connectivity Effective wire utilization

Generalized Express Cubes Framework & taxonomy for NOC topologies Extension of the k-ary n-cube model Useful for understanding and exploring

on-chip interconnect options Future: expand & formalize

UTCS 264HPCA '09

Page 265: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Kilo-NoC: Topology-Aware QoS

Boris Grot, Joel Hestness, Stephen W. Keckler, and Onur Mutlu,"Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees"

Proceedings of the 38th International Symposium on Computer Architecture

(ISCA), San Jose, CA, June 2011. Slides (pptx)

Page 266: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Motivation Extreme-scale chip-level integration

Cores Cache banks Accelerators I/O logic Network-on-chip (NOC)

10-100 cores today 1000+ assets in the near future

266

Page 267: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Kilo-NOC requirements High efficiency

Area Energy

Good performance Strong service guarantees (QoS)

267

Page 268: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Topology-Aware QoS Problem: QoS support in each router is expensive (in

terms of buffering, arbitration, bookkeeping) E.g., Grot et al., “Preemptive Virtual Clock: A Flexible,

Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009.

Goal: Provide QoS guarantees at low area and power cost

Idea: Isolate shared resources in a region of the network,

support QoS within that area Design the topology so that applications can access

the region without interference268

Page 269: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Baseline QOS-enabled CMP

Multiple VMs sharing a die

269

Shared resources (e.g., memory controllers)

VM-private resources (cores, caches)

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

QOS-enabled routerQ

Page 270: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conventional NOC QOS

Contention scenarios:

Shared resources memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

270

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

Page 271: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Conventional NOC QOS

271

Q Q Q Q

Q Q Q Q

Q Q Q Q

Q Q Q Q

VM #1

VM #1

VM #3

VM #2

Contention scenarios:

Shared resources memory access

Intra-VM traffic shared cache access

Inter-VM traffic VM page sharing

Network-wide guarantees without network-wide QOS support

Page 272: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Kilo-NOC QOS Insight: leverage rich network connectivity

Naturally reduce interference among flows Limit the extent of hardware QOS support

Requires a low-diameter topology This work: Multidrop Express Channels (MECS)

272

Grot et al., HPCA 2009

Page 273: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dedicated, QOS-enabled regions Rest of die: QOS-free

Richly-connected topology Traffic isolation

Special routing rules Manage interference

273

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

QOS-free

Topology-Aware QOS

Page 274: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dedicated, QOS-enabled regions Rest of die: QOS-free

Richly-connected topology Traffic isolation

Special routing rules Manage interference

274

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

Topology-Aware QOS

Page 275: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dedicated, QOS-enabled regions Rest of die: QOS-free

Richly-connected topology Traffic isolation

Special routing rules Manage interference

275

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

Topology-Aware QOS

Page 276: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Dedicated, QOS-enabled regions Rest of die: QOS-free

Richly-connected topology Traffic isolation

Special routing rules Manage interference

276

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

Topology-Aware QOS

Page 277: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Topology-aware QOS support Limit QOS complexity

to a fraction of the die

Optimized flow control Reduce buffer

requirements in QOS-free regions

277

Q

Q

Q

Q

VM #1 VM #2

VM #1

VM #3

QOS-free

Kilo-NOC view

Page 278: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Evaluation Methodology

Parameter Value

Technology 15 nm

Vdd 0.7 V

System 1024 tiles:256 concentrated nodes (64 shared resources)

Networks:

MECS+PVC VC flow control, QOS support (PVC) at each node

MECS+TAQ VC flow control, QOS support only in shared regions

MECS+TAQ+EB

EB flow control outside of SRs, Separate Request and Reply networks

K-MECS Proposed organization: TAQ + hybrid flow control

278

Page 279: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Area comparison

279

Page 280: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Energy comparison

280

Page 281: Computer Architecture: Interconnects (Part I) Prof. Onur Mutlu Carnegie Mellon University

Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates Topology-aware QOS

Limits QOS support to a fraction of the die

Leverages low-diameter topologies Improves NOC area- and energy-

efficiency Provides strong guarantees

281

Summary