Transcript
Page 1: Energy-Efficient Super Computing CurtHar(ng% Vishal%Parikh ... · Application Energy Breakdown Energy-Efficient Super Computing Stanford University Programming System The goal of

Curt  Har(ng  Vishal  Parikh  Milad  Mohammadi  Tarun  Pondicherry  Prof.  William  J.  Dally  

0  

Motivation

Overview Cache Hierarchy Memory & Communication

Application Energy Breakdown

Energy-Efficient Super Computing Stanford University

Programming System

The goal of the Efficient Supercomputing Project is to significantly reduce the amount of energy consumed executing scientific code while providing programmers an API that allows for productive algorithm implementation. We do this by exposing locality to the programmer, minimize unnecessary network traffic, and reduce cache contention and meta-data overhead. Goals

- Design a high-performance efficient architecture, that provides parallelism with minimal overhead over 100s-1000s of cores - Enable faster, more efficient code through

software configuration of cache hierarchies and active messages - Provide programming system, allowing

developers to productively implement algorithms that optimally use hardware

)*()**(

)*(

sec

sec

sec

AccessesEdBitsE

OperationsEPower

Cache

Comm

CoreOp

+

+=TCO of a Data Center1

55% due to power requirements

1. Barroso,  Holzle.  The  Data  Center  as  a  Computer.  Morgan  and  Claypool.  2009

Data  Center  Capital  8%  

Server  Capital  28%  

Data  Center  Op-­‐Ex  8%  Server  Op-­‐Ex  

1%  

Power  Provisioning  

16%  

Power  Overhead  

28%  

Server  Power  11%   - Consumer demand for computational

capabilities is increasing, while power envelopes are stationary or decreasing. - Scaling device dimensions and supply

voltage used to scale energy per operation enough, however, that is no longer the case - Architectural innovation becomes

critical in making computers more energy efficient and allowing performance to continue to grow

Core  31%  

Caches  10%  

DRAM  14%  

Network  45%  

Exposed Data Locality

- Software configuration of cache hierarchies improves performance - Allow the user to

configure cache domains - Provide APIs to allow for

pinning data to local storage - Convert portions of

SRAM to non-coherent, locally addressed scratchpad memory

The Kapok project is focused on reducing the amount of energy consumed in the data supply on chip. Optimizing the coherent cache hierarchy is an important means by which we can do this. Novel structures and programming interfaces must be developed to improve coherency scalability.

Core  

L1  L[2-­‐N]Tag  

Slices    

L2  

L3  L4  

Scratchpad  

Configurability

- Allow programmers to control data location - Less energy spent

locating and moving the data on loads and stores - Managed either via

hardware or software

Problem   Proposed  Solu.on/Improvement  

High  directory  associa(vity  

Hash  based  directories  to  improve  average  access  latency  and  energy  

Applica(on  Diversity   A  hardware  API  that  allows  for  cache  configurability  Long  distance    miss  traversals  

A  hierarchy  of  directories  designed  to  keep  miss  distance  and  energy  at  a  minimum.  

Cache  miss  penal(es   API  designed  for  the  programmer  to  be  able  to  specify  block  data  transfers,  pinning  data,  and  other  op(miza(ons  

Scalability Problems

Energy  savings  are  due  to  reduced  travel  distance  of  cache  traffic  

Other  applica(ons  do  not  benefit  from  hierarchy,  due  to  poor  reuse  or…  

…not  fi_ng  in  the  cache  

Remote Messaging & Communication

Since communication and memory energy does not scale with computational energy, data movement will become a larger problem as devices scale. Active messaging, block transfers, and fast barriers are examples of efficient communication mechanisms provided by Kapok.

Active Messages Energy Speedup

- The key to reducing the amount of energy consumed in cache coherency protocols is simple: do not miss - Access highly contended variables/locks at their home node via active messages

instead of invalidating loads and stores - Configurable cache hierarchy allows programmers to take advantage of different

forms of sharing

The programming system will simplify interacting with the underlying memory system without compromising configurability. Programmers should only need to focus on expressing high level intent. Syntax analysis and profiling will partially automate selecting the communication mechanism. Annotations in code will signal programmer intent.

Memory address contended, use remote writes

Good locality, use cache

Thread A Write

Thread B Write

Thread C Write

Profiling

- Profiling information used to suggest communication mechanisms - Design compilers to automatically select communication mechanisms for

programs

Splash 2 Radix Sort Energy Consumption

Opera.on   Energy  (AU)  

64b  Integer  Add   1  64b  Flop   50  Read  8kB  Cache     30  Route  64b  on  chip   160  

0

0.5

1

1.5

2

2.5

3

3.5

4

BFS Hash  Table Kmeans Radix  Sort

Spee

dup

Benchmark

00.20.40.60.81

1.21.4

BL AM BL AM BL AM BL AM Splash

BFS Hash  Table Kmeans Radix  Sort

Energy,  N

ormalized

 to  BL

Benchmark

Core L1  Cache L2/L3  Cache DRAM Network  (Memory) Network  (AM)

- Data movement is 45% of the energy in a many-core radix sort, 37% in FFT, 88% in a hash table - Cache coherent shared memory obfuscates this energy - Improve energy-efficiency and performance in many-

core processors - Our research targets all types of energy consumption

shown

Thread 2

Tim

e

Execute

Load Lock

Execute

Thread 1

Rcv Lock Load Data

Rcv Data

Load Lock

Fail Lock

WaitInvUnlock

(Invalidates)

Rcv Lock Load Data

Load Lock

Rcv Data

Assemble & Send

Wait for Reply

AM1AM2 Execute

AM1

ExecuteAM2

AM1_Reply

AM2_Reply

Home NodeT0 T1

Tim

e

Cache Hierarchy Energy Breakdown

A  hierarchical  directory  can  reduce  the  total  energy  required  

Top Related