energy-e¯¬’cient super computing curthar(ng% vishal%parikh ... application energy...

Download Energy-E¯¬’cient Super Computing CurtHar(ng% Vishal%Parikh ... Application Energy Breakdown Energy-E¯¬’cient

If you can't read please download the document

Post on 21-Jul-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Curt  Har(ng   Vishal  Parikh   Milad  Mohammadi   Tarun  Pondicherry   Prof.  William  J.  Dally  

    0  

    Motivation

    Overview Cache Hierarchy Memory & Communication

    Application Energy Breakdown

    Energy-Efficient Super Computing Stanford University

    Programming System

    The goal of the Efficient Supercomputing Project is to significantly reduce the amount of energy consumed executing scientific code while providing programmers an API that allows for productive algorithm implementation. We do this by exposing locality to the programmer, minimize unnecessary network traffic, and reduce cache contention and meta-data overhead. Goals

    - Design a high-performance efficient architecture, that provides parallelism with minimal overhead over 100s-1000s of cores - Enable faster, more efficient code through

    software configuration of cache hierarchies and active messages - Provide programming system, allowing

    developers to productively implement algorithms that optimally use hardware

    )*( )**(

    )*(

    sec

    sec

    sec

    AccessesE dBitsE

    OperationsEPower

    Cache

    Comm

    CoreOp

    +

    += TCO of a Data Center1

    55% due to power requirements

    1. Barroso,  Holzle.  The  Data  Center  as  a  Computer.  Morgan  and   Claypool.  2009

    Data  Center   Capital   8%  

    Server   Capital   28%  

    Data  Center   Op-­‐Ex   8%  Server  Op-­‐Ex  

    1%  

    Power   Provisioning  

    16%  

    Power   Overhead  

    28%  

    Server   Power   11%   - Consumer demand for computational

    capabilities is increasing, while power envelopes are stationary or decreasing. - Scaling device dimensions and supply

    voltage used to scale energy per operation enough, however, that is no longer the case - Architectural innovation becomes

    critical in making computers more energy efficient and allowing performance to continue to grow

    Core   31%  

    Caches   10%  

    DRAM   14%  

    Network   45%  

    Exposed Data Locality

    - Software configuration of cache hierarchies improves performance - Allow the user to

    configure cache domains - Provide APIs to allow for

    pinning data to local storage - Convert portions of

    SRAM to non-coherent, locally addressed scratchpad memory

    The Kapok project is focused on reducing the amount of energy consumed in the data supply on chip. Optimizing the coherent cache hierarchy is an important means by which we can do this. Novel structures and programming interfaces must be developed to improve coherency scalability.

    Core  

    L1   L[2-­‐ N]Ta g  

    Slice s    

    L2  

    L3   L4  

    Scratchpad  

    Configurability

    - Allow programmers to control data location - Less energy spent

    locating and moving the data on loads and stores - Managed either via

    hardware or software

    Problem   Proposed  Solu.on/Improvement  

    High  directory   associa(vity  

    Hash  based  directories  to  improve  average  access  latency  and   energy  

    Applica(on  Diversity   A  hardware  API  that  allows  for  cache  configurability   Long  distance    miss   traversals  

    A  hierarchy  of  directories  designed  to  keep  miss  distance  and  energy   at  a  minimum.  

    Cache  miss  penal(es   API  designed  for  the  programmer  to  be  able  to  specify  block  data   transfers,  pinning  data,  and  other  op(miza(ons  

    Scalability Problems

    Energy  savings  are  due  to  reduced   travel  distance  of  cache  traffic  

    Other  applica(ons  do  not  benefit   from  hierarchy,  due  to  poor  reuse   or…  

    …not  fi_ng  in  the  cache  

    Remote Messaging & Communication

    Since communication and memory energy does not scale with computational energy, data movement will become a larger problem as devices scale. Active messaging, block transfers, and fast barriers are examples of efficient communication mechanisms provided by Kapok.

    Active Messages Energy Speedup

    - The key to reducing the amount of energy consumed in cache coherency protocols is simple: do not miss - Access highly contended variables/locks at their home node via active messages

    instead of invalidating loads and stores - Configurable cache hierarchy allows programmers to take advantage of different

    forms of sharing

    The programming system will simplify interacting with the underlying memory system without compromising configurability. Programmers should only need to focus on expressing high level intent. Syntax analysis and profiling will partially automate selecting the communication mechanism. Annotations in code will signal programmer intent.

    Memory address contended, use remote writes

    Good locality, use cache

    Thread A Write

    Thread B Write

    Thread C Write

    Profiling

    - Profiling information used to suggest communication mechanisms - Design compilers to automatically select communication mechanisms for

    programs

    Splash 2 Radix Sort Energy Consumption

    Opera.on   Energy   (AU)  

    64b  Integer  Add   1   64b  Flop   50   Read  8kB  Cache     30   Route  64b  on  chip   160  

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    BFS Hash  Table Kmeans Radix  Sort

    Sp ee

    du p

    Benchmark

    0 0.2 0.4 0.6 0.8 1

    1.2 1.4

    BL AM BL AM BL AM BL AM Splash

    BFS Hash  Table Kmeans Radix  Sort

    En er gy ,  N

    or m al iz ed

     t o   BL

    Benchmark

    Core L1  Cache L2/L3  Cache DRAM Network  (Memory) Network  (AM)

    - Data movement is 45% of the energy in a many-core radix sort, 37% in FFT, 88% in a hash table - Cache coherent shared memory obfuscates this energy - Improve energy-efficiency and performance in many-

    core processors - Our research targets all types of energy consumption

    shown

    Thread 2

    Ti m

    e

    Execute

    Load Lock

    Execute

    Thread 1

    Rcv Lock Load Data

    Rcv Data

    Load Lock

    Fail Lock

    WaitInv Unlock

    (Invalidates)

    Rcv Lock Load Data

    Load Lock

    Rcv Data

    Assemble & Send

    Wait for Reply

    AM1 AM2 Execute

    AM1

    Execute AM2

    AM1_Reply

    AM2_Reply

    Home NodeT0 T1

    Ti m

    e

    Cache Hierarchy Energy Breakdown

    A  hierarchical  directory  can   reduce  the  total  energy  required  

Recommended

View more >