energy-efficient super computing curthar(ng% vishal%parikh ... · application energy breakdown...

1
Curt Har(ng Vishal Parikh Milad Mohammadi Tarun Pondicherry Prof. William J. Dally 0 Motivation Overview Cache Hierarchy Memory & Communication Application Energy Breakdown Energy-Ecient Super Computing Stanford University Programming System The goal of the Efficient Supercomputing Project is to significantly reduce the amount of energy consumed executing scientific code while providing programmers an API that allows for productive algorithm implementation. We do this by exposing locality to the programmer, minimize unnecessary network traffic, and reduce cache contention and meta-data overhead. Goals - Design a high-performance efficient architecture, that provides parallelism with minimal overhead over 100s-1000s of cores - Enable faster, more efficient code through software configuration of cache hierarchies and active messages - Provide programming system, allowing developers to productively implement algorithms that optimally use hardware ) * ( ) * * ( ) * ( sec sec sec Accesses E d Bits E Operations E Power Cache Comm CoreOp + + = TCO of a Data Center 1 55% due to power requirements 1. Barroso, Holzle. The Data Center as a Computer. Morgan and Claypool. 2009 Data Center Capital 8% Server Capital 28% Data Center OpEx 8% Server OpEx 1% Power Provisioning 16% Power Overhead 28% Server Power 11% - Consumer demand for computational capabilities is increasing, while power envelopes are stationary or decreasing. - Scaling device dimensions and supply voltage used to scale energy per operation enough, however, that is no longer the case - Architectural innovation becomes critical in making computers more energy efficient and allowing performance to continue to grow Core 31% Caches 10% DRAM 14% Network 45% Exposed Data Locality -Software configuration of cache hierarchies improves performance -Allow the user to configure cache domains -Provide APIs to allow for pinning data to local storage -Convert portions of SRAM to non-coherent, locally addressed scratchpad memory The Kapok project is focused on reducing the amount of energy consumed in the data supply on chip. Optimizing the coherent cache hierarchy is an important means by which we can do this. Novel structures and programming interfaces must be developed to improve coherency scalability. Core L1 L[2 g Slice L2 L3 L4 Scratchpad Configurability -Allow programmers to control data location -Less energy spent locating and moving the data on loads and stores -Managed either via hardware or software Problem Proposed Solu.on/Improvement High directory associa(vity Hash based directories to improve average access latency and energy Applica(on Diversity A hardware API that allows for cache configurability Long distance miss traversals A hierarchy of directories designed to keep miss distance and energy at a minimum. Cache miss penal(es API designed for the programmer to be able to specify block data transfers, pinning data, and other op(miza(ons Scalability Problems Energy savings are due to reduced travel distance of cache traffic Other applica(ons do not benefit from hierarchy, due to poor reuse or… …not fi_ng in the cache Remote Messaging & Communication Since communication and memory energy does not scale with computational energy, data movement will become a larger problem as devices scale. Active messaging, block transfers, and fast barriers are examples of efficient communication mechanisms provided by Kapok. Active Messages Energy Speedup -The key to reducing the amount of energy consumed in cache coherency protocols is simple: do not miss -Access highly contended variables/locks at their home node via active messages instead of invalidating loads and stores -Configurable cache hierarchy allows programmers to take advantage of different forms of sharing The programming system will simplify interacting with the underlying memory system without compromising configurability. Programmers should only need to focus on expressing high level intent. Syntax analysis and profiling will partially automate selecting the communication mechanism. Annotations in code will signal programmer intent. Memory address contended, use remote writes Good locality, use cache Thread A Write Thread B Write Thread C Write Profiling -Profiling information used to suggest communication mechanisms -Design compilers to automatically select communication mechanisms for programs Splash 2 Radix Sort Energy Consumption Opera.on Energy (AU) 64b Integer Add 1 64b Flop 50 Read 8kB Cache 30 Route 64b on chip 160 0 0.5 1 1.5 2 2.5 3 3.5 4 BFS Hash Table Kmeans Radix Sort Speedup Benchmark 0 0.2 0.4 0.6 0.8 1 1.2 1.4 BL AM BL AM BL AM BL AM Splash BFS Hash Table Kmeans Radix Sort Energy, Normalized to BL Benchmark Core L1 Cache L2/L3 Cache DRAM Network (Memory) Network (AM) - Data movement is 45% of the energy in a many-core radix sort, 37% in FFT, 88% in a hash table - Cache coherent shared memory obfuscates this energy - Improve energy-efficiency and performance in many- core processors - Our research targets all types of energy consumption shown Thread 2 Time Execute Load Lock Execute Thread 1 Rcv Lock Load Data Rcv Data Load Lock Fail Lock Wait Inv Unlock (Invalidates) Rcv Lock Load Data Load Lock Rcv Data Assemble & Send Wait for Reply AM1 AM2 Execute AM1 Execute AM2 AM1_Reply AM2_Reply Home Node T0 T1 Time Cache Hierarchy Energy Breakdown A hierarchical directory can reduce the total energy required

Upload: others

Post on 21-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Energy-Efficient Super Computing CurtHar(ng% Vishal%Parikh ... · Application Energy Breakdown Energy-Efficient Super Computing Stanford University Programming System The goal of

Curt  Har(ng  Vishal  Parikh  Milad  Mohammadi  Tarun  Pondicherry  Prof.  William  J.  Dally  

0  

Motivation

Overview Cache Hierarchy Memory & Communication

Application Energy Breakdown

Energy-Efficient Super Computing Stanford University

Programming System

The goal of the Efficient Supercomputing Project is to significantly reduce the amount of energy consumed executing scientific code while providing programmers an API that allows for productive algorithm implementation. We do this by exposing locality to the programmer, minimize unnecessary network traffic, and reduce cache contention and meta-data overhead. Goals

- Design a high-performance efficient architecture, that provides parallelism with minimal overhead over 100s-1000s of cores - Enable faster, more efficient code through

software configuration of cache hierarchies and active messages - Provide programming system, allowing

developers to productively implement algorithms that optimally use hardware

)*()**(

)*(

sec

sec

sec

AccessesEdBitsE

OperationsEPower

Cache

Comm

CoreOp

+

+=TCO of a Data Center1

55% due to power requirements

1. Barroso,  Holzle.  The  Data  Center  as  a  Computer.  Morgan  and  Claypool.  2009

Data  Center  Capital  8%  

Server  Capital  28%  

Data  Center  Op-­‐Ex  8%  Server  Op-­‐Ex  

1%  

Power  Provisioning  

16%  

Power  Overhead  

28%  

Server  Power  11%   - Consumer demand for computational

capabilities is increasing, while power envelopes are stationary or decreasing. - Scaling device dimensions and supply

voltage used to scale energy per operation enough, however, that is no longer the case - Architectural innovation becomes

critical in making computers more energy efficient and allowing performance to continue to grow

Core  31%  

Caches  10%  

DRAM  14%  

Network  45%  

Exposed Data Locality

- Software configuration of cache hierarchies improves performance - Allow the user to

configure cache domains - Provide APIs to allow for

pinning data to local storage - Convert portions of

SRAM to non-coherent, locally addressed scratchpad memory

The Kapok project is focused on reducing the amount of energy consumed in the data supply on chip. Optimizing the coherent cache hierarchy is an important means by which we can do this. Novel structures and programming interfaces must be developed to improve coherency scalability.

Core  

L1  L[2-­‐N]Tag  

Slices    

L2  

L3  L4  

Scratchpad  

Configurability

- Allow programmers to control data location - Less energy spent

locating and moving the data on loads and stores - Managed either via

hardware or software

Problem   Proposed  Solu.on/Improvement  

High  directory  associa(vity  

Hash  based  directories  to  improve  average  access  latency  and  energy  

Applica(on  Diversity   A  hardware  API  that  allows  for  cache  configurability  Long  distance    miss  traversals  

A  hierarchy  of  directories  designed  to  keep  miss  distance  and  energy  at  a  minimum.  

Cache  miss  penal(es   API  designed  for  the  programmer  to  be  able  to  specify  block  data  transfers,  pinning  data,  and  other  op(miza(ons  

Scalability Problems

Energy  savings  are  due  to  reduced  travel  distance  of  cache  traffic  

Other  applica(ons  do  not  benefit  from  hierarchy,  due  to  poor  reuse  or…  

…not  fi_ng  in  the  cache  

Remote Messaging & Communication

Since communication and memory energy does not scale with computational energy, data movement will become a larger problem as devices scale. Active messaging, block transfers, and fast barriers are examples of efficient communication mechanisms provided by Kapok.

Active Messages Energy Speedup

- The key to reducing the amount of energy consumed in cache coherency protocols is simple: do not miss - Access highly contended variables/locks at their home node via active messages

instead of invalidating loads and stores - Configurable cache hierarchy allows programmers to take advantage of different

forms of sharing

The programming system will simplify interacting with the underlying memory system without compromising configurability. Programmers should only need to focus on expressing high level intent. Syntax analysis and profiling will partially automate selecting the communication mechanism. Annotations in code will signal programmer intent.

Memory address contended, use remote writes

Good locality, use cache

Thread A Write

Thread B Write

Thread C Write

Profiling

- Profiling information used to suggest communication mechanisms - Design compilers to automatically select communication mechanisms for

programs

Splash 2 Radix Sort Energy Consumption

Opera.on   Energy  (AU)  

64b  Integer  Add   1  64b  Flop   50  Read  8kB  Cache     30  Route  64b  on  chip   160  

0

0.5

1

1.5

2

2.5

3

3.5

4

BFS Hash  Table Kmeans Radix  Sort

Spee

dup

Benchmark

00.20.40.60.81

1.21.4

BL AM BL AM BL AM BL AM Splash

BFS Hash  Table Kmeans Radix  Sort

Energy,  N

ormalized

 to  BL

Benchmark

Core L1  Cache L2/L3  Cache DRAM Network  (Memory) Network  (AM)

- Data movement is 45% of the energy in a many-core radix sort, 37% in FFT, 88% in a hash table - Cache coherent shared memory obfuscates this energy - Improve energy-efficiency and performance in many-

core processors - Our research targets all types of energy consumption

shown

Thread 2

Tim

e

Execute

Load Lock

Execute

Thread 1

Rcv Lock Load Data

Rcv Data

Load Lock

Fail Lock

WaitInvUnlock

(Invalidates)

Rcv Lock Load Data

Load Lock

Rcv Data

Assemble & Send

Wait for Reply

AM1AM2 Execute

AM1

ExecuteAM2

AM1_Reply

AM2_Reply

Home NodeT0 T1

Tim

e

Cache Hierarchy Energy Breakdown

A  hierarchical  directory  can  reduce  the  total  energy  required