energy-eﬃcient super computing curthar(ng% vishal%parikh ... · application energy breakdown...

Curt Har(ng Vishal Parikh Milad Mohammadi Tarun Pondicherry Prof. William J. Dally

0

Motivation

Overview Cache Hierarchy Memory & Communication

Application Energy Breakdown

Energy-Efficient Super Computing Stanford University

Programming System

The goal of the Efficient Supercomputing Project is to significantly reduce the amount of energy consumed executing scientific code while providing programmers an API that allows for productive algorithm implementation. We do this by exposing locality to the programmer, minimize unnecessary network traffic, and reduce cache contention and meta-data overhead. Goals

- Design a high-performance efficient architecture, that provides parallelism with minimal overhead over 100s-1000s of cores - Enable faster, more efficient code through

software configuration of cache hierarchies and active messages - Provide programming system, allowing

developers to productively implement algorithms that optimally use hardware

)*()**(

)*(

sec

sec

sec

AccessesEdBitsE

OperationsEPower

Cache

Comm

CoreOp

+

+=TCO of a Data Center1

55% due to power requirements

1. Barroso, Holzle. The Data Center as a Computer. Morgan and Claypool. 2009

Data Center Capital 8%

Server Capital 28%

Data Center Op-‐Ex 8% Server Op-‐Ex

1%

Power Provisioning

16%

Power Overhead

28%

Server Power 11% - Consumer demand for computational

capabilities is increasing, while power envelopes are stationary or decreasing. - Scaling device dimensions and supply

voltage used to scale energy per operation enough, however, that is no longer the case - Architectural innovation becomes

critical in making computers more energy efficient and allowing performance to continue to grow

Core 31%

Caches 10%

DRAM 14%

Network 45%

Exposed Data Locality

- Software configuration of cache hierarchies improves performance - Allow the user to

configure cache domains - Provide APIs to allow for

pinning data to local storage - Convert portions of

SRAM to non-coherent, locally addressed scratchpad memory

The Kapok project is focused on reducing the amount of energy consumed in the data supply on chip. Optimizing the coherent cache hierarchy is an important means by which we can do this. Novel structures and programming interfaces must be developed to improve coherency scalability.

Core

L1 L[2-‐N]Tag

Slices

L2

L3 L4

Scratchpad

Configurability

- Allow programmers to control data location - Less energy spent

locating and moving the data on loads and stores - Managed either via

hardware or software

Problem Proposed Solu.on/Improvement

High directory associa(vity

Hash based directories to improve average access latency and energy

Applica(on Diversity A hardware API that allows for cache configurability Long distance miss traversals

A hierarchy of directories designed to keep miss distance and energy at a minimum.

Cache miss penal(es API designed for the programmer to be able to specify block data transfers, pinning data, and other op(miza(ons

Scalability Problems

Energy savings are due to reduced travel distance of cache traffic

Other applica(ons do not benefit from hierarchy, due to poor reuse or…

…not fi_ng in the cache

Remote Messaging & Communication

Since communication and memory energy does not scale with computational energy, data movement will become a larger problem as devices scale. Active messaging, block transfers, and fast barriers are examples of efficient communication mechanisms provided by Kapok.

Active Messages Energy Speedup

- The key to reducing the amount of energy consumed in cache coherency protocols is simple: do not miss - Access highly contended variables/locks at their home node via active messages

instead of invalidating loads and stores - Configurable cache hierarchy allows programmers to take advantage of different

forms of sharing

The programming system will simplify interacting with the underlying memory system without compromising configurability. Programmers should only need to focus on expressing high level intent. Syntax analysis and profiling will partially automate selecting the communication mechanism. Annotations in code will signal programmer intent.

Memory address contended, use remote writes

Good locality, use cache

Thread A Write

Thread B Write

Thread C Write

Profiling

- Profiling information used to suggest communication mechanisms - Design compilers to automatically select communication mechanisms for

programs

Splash 2 Radix Sort Energy Consumption

Opera.on Energy (AU)

64b Integer Add 1 64b Flop 50 Read 8kB Cache 30 Route 64b on chip 160

0

0.5

1

1.5

2

2.5

3

3.5

4

BFS Hash Table Kmeans Radix Sort

Spee

dup

Benchmark

00.20.40.60.81

1.21.4

BL AM BL AM BL AM BL AM Splash

BFS Hash Table Kmeans Radix Sort

Energy, N

ormalized

to BL

Benchmark

Core L1 Cache L2/L3 Cache DRAM Network (Memory) Network (AM)

- Data movement is 45% of the energy in a many-core radix sort, 37% in FFT, 88% in a hash table - Cache coherent shared memory obfuscates this energy - Improve energy-efficiency and performance in many-

core processors - Our research targets all types of energy consumption

shown

Thread 2

Tim

e

Execute

Load Lock

Execute

Thread 1

Rcv Lock Load Data

Rcv Data

Load Lock

Fail Lock

WaitInvUnlock

(Invalidates)

Rcv Lock Load Data

Load Lock

Rcv Data

Assemble & Send

Wait for Reply

AM1AM2 Execute

AM1

ExecuteAM2

AM1_Reply

AM2_Reply

Home NodeT0 T1

Tim

e

Cache Hierarchy Energy Breakdown

A hierarchical directory can reduce the total energy required

energy-eﬃcient super computing curthar(ng% vishal%parikh ... · application energy breakdown...

Documents