dally sc10

Upload: alex-vlx

Post on 09-Apr-2018

232 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/8/2019 Dally SC10

    1/55

  • 8/8/2019 Dally SC10

    2/55

    To ExaScale and Beyond

    2The GPU is the Computer

    3

    The GPU Advantage

    1

    GPU Computing

  • 8/8/2019 Dally SC10

    3/55

    The GPU Advantag

  • 8/8/2019 Dally SC10

    4/55

    A Tale of Two Machines

    The GPU Advantag

  • 8/8/2019 Dally SC10

    5/55

    Tianhe-1Aat NSC Tianjin

  • 8/8/2019 Dally SC10

    6/55

    Tianhe-1Aat NSC Tianjin

    The Worlds Fastest Supercomputer

    2.507 Petaflop

    7168 Tesla M2050 GPUs

  • 8/8/2019 Dally SC10

    7/55

    Tesla M2050 GPUs

  • 8/8/2019 Dally SC10

    8/55

    3 of Top5 Supercomputers

    0

    500

    1000

    1500

    2000

    2500

    Tianhe-1A Jaguar Nebulae Tsubame Hoppe

    Gigaflops

  • 8/8/2019 Dally SC10

    9/55

    0

    500

    1000

    1500

    2000

    2500

    Tianhe-1A Jaguar Nebulae Tsubame Hoppe

    Gigaflops

    Top 5 Performance and Power

  • 8/8/2019 Dally SC10

    10/55

    NVIDIA/NCSAGreen 500 Entry

  • 8/8/2019 Dally SC10

    11/55

    NVIDIA/NCSAGreen 500 Entry

  • 8/8/2019 Dally SC10

    12/55

    NVIDIA/NCSA Green 500 Entry

    128 nodes, each with:1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLOP peak)

    1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLOP peak)

    4x QDR Infiniband

    4 GB DRAM

    Theoretical Peak Perf: 68.95 TF

    Footprint: ~20 ft^2 => 3.45 TF/ft^2Cost: $500K (street price) => 137.9 MF/$

    Linpack: 33.62 TF, 36.0 kW => 934 MF/W

  • 8/8/2019 Dally SC10

    13/55

    Efficiency and Programmability

    The GPU Advantag

  • 8/8/2019 Dally SC10

    14/55

    GPU200pJ/Instruction

    CPU2nJ/Instructio

  • 8/8/2019 Dally SC10

    15/55

  • 8/8/2019 Dally SC10

    16/55

    CUDA GPU Roadmap

    16

    2

    4

    6

    8

    10

    12

    14

    DPG

    FLOPSperWatt

    2007 2009 2011

    TeslaFermi

    Kepler

  • 8/8/2019 Dally SC10

    17/55

    Efficiency and Programmability

    The GPU Advantag

  • 8/8/2019 Dally SC10

    18/55

    CUDA Enables Programmability

    The GPU Advantag

  • 8/8/2019 Dally SC10

    19/55

    CUDA C: C with a Few Keywords

    void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}

    // Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *{

    int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

  • 8/8/2019 Dally SC10

    20/55

    DirectX

    GPU Computing Ecosystem

    Languages & APIs

    ToIntegratedDevelopment Environment

    Parallel Nsight for MS Visual Studio

    MathematicalPackages

    Cons&

    Research & Education

    All Major Platforms

    Libraries

    Fortran

  • 8/8/2019 Dally SC10

    21/55

    GPU Computing ToBy the Numbers:

    CUDA Capable GPUs200 Million

    CUDA Toolkit Downloads600,000

    Active GPU Computing Developers100,000

    Members in Parallel Nsight Developer 8,000

    Universities Teaching CUDA Worldwid362

    CUDA Centers of Excellence Worldwid11

  • 8/8/2019 Dally SC10

    22/55

    To ExaScale and Beyo

  • 8/8/2019 Dally SC10

    23/55

    Science Needs 1000x More Computing

    1,000,000,000

    1,000,000

    1,000

    1

    Gigaflops

    1982 1997 2003 2006 2010

    Estrogen Receptor36K atoms

    F1-ATPase327K atoms

    Ribosome2.7M atoms

    Chromatophore50M atoms

    BPTI3K atoms

    1 Exaflop

    1 Petaflop

    Ran for 8 months tosimulate 2 nanoseco

  • 8/8/2019 Dally SC10

    24/55

    DARPA Study Identifies Four Challenges fExaScale Computing

    Report published September

    Four Major ChallengesEnergy and Power challenge

    Memory and Storage challenge

    Concurrency and Locality challeng

    Resiliency challenge

    Number one issue is power

    Extrapolations of current architectutechnology indicate over 100MW fo

    Power also constrains what we can

    Available atwww.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf

  • 8/8/2019 Dally SC10

    25/55

    Power is THE Probl

  • 8/8/2019 Dally SC10

    26/55

    A GPU is the Solution

    Power is THE Probl

    O S G O S

  • 8/8/2019 Dally SC10

    27/55

    ExaFLOPS at 20MW = 50GFLOPS/W

    0.1

    1

    10

    100

    2010 2013

    GF

    LOPS/W(

    Core)

    GPU5GFLOPS/W

    50GFLOPS/W

  • 8/8/2019 Dally SC10

    28/55

    50GFLOPS/W10x Energy Gap for Todays GPU

    0.1

    1

    10

    100

    2010 2013

    GF

    LOPS/W(

    Core)

    GPU5GFLOPS/W

    10x

    ExaFLOPS Gap

    GPU Cl th G ith

  • 8/8/2019 Dally SC10

    29/55

    0.1

    1

    10

    100

    2010 2013

    GF

    LOPS/W(

    Core)

    GPUs Close the Gap withProcess and Architecture

    GPU Cl th G

  • 8/8/2019 Dally SC10

    30/55

    0.1

    1

    10

    100

    2010 2013

    GFLOPS/W

    GPUs Close the Gapwith Process and Architecture

    GPU Cl th G

  • 8/8/2019 Dally SC10

    31/55

    0.1

    1

    10

    100

    2010 2013

    GFLOPS/W

    GPUs Close the GapWith CPUs, a Gap Remains

    100

  • 8/8/2019 Dally SC10

    32/55

    GPUs Close the GapWith CPUs, a Gap Remains

    Heterogeneous Compis Required to get to

    0.1

    1

    10

    100

    2010 2013

    GFLOPS

    /W

  • 8/8/2019 Dally SC10

    33/55

    NVIDIAs Extreme-Scale Computing Pro

    Echelon

  • 8/8/2019 Dally SC10

    34/55

    Echelon Team

    S Sk h

  • 8/8/2019 Dally SC10

    35/55

    System Sketch

    LoC

    Echelon System

    Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB

    Module 0 (M)) 160TF, 12.8TB/s, 2TB M15

    Node 0 (N0) 20TF, 1.6TB/s, 256GB

    Processor Chip (PC)

    L0

    C0

    SM0

    L0

    C7

    NoC

    SM127

    MC NICL20 L21023

    DRAMCube

    DRAMCube

    NVRAM

    High-Radix Router Module (RM)

    CN

    Dragonfly Interconnect (optical fiber)

    N7

    LC0

    LC7

    E ti M d l

  • 8/8/2019 Dally SC10

    36/55

    Execution Model

    A B

    Active Message

    Abstract MemoryHierarchy

    Global Address Space

    ThreadObject

    B

    Load/Store

    A

    B

    The High Cost of Data Movement

  • 8/8/2019 Dally SC10

    37/55

    The High Cost of Data MovementFetching operands costs more than computing on th

    20mm

    64-bit DP20pJ 26 pJ 256 pJ

    1 nJ

    500 pJ Efficieoff-chi

    28nm

    256-bitbuses

    16 nJDRAMRd/Wr

    256-bit access8 kB SRAM

    50 pJ

  • 8/8/2019 Dally SC10

    38/55

    An NVIDIA ExaScale Mac

    Lane 4 DFMAs 20GFLOPS

  • 8/8/2019 Dally SC10

    39/55

    Lane 4 DFMAs, 20GFLOPS

    DFMA DFMA DFMA DFMA

    MainRegisters

    LSI L

    Operand Registers

    L0 I$

    L0 D$

    SM 8 lanes 160GFLOPS

  • 8/8/2019 Dally SC10

    40/55

    SM 8 lanes 160GFLOPS

    P P P P P P P

    Switch

    L1$

  • 8/8/2019 Dally SC10

    41/55

    Node MCM 20TF 256GB

  • 8/8/2019 Dally SC10

    42/55

    Node MCM 20TF + 256GB

    GPU Chip20TF DP256MB

    1.4TB/sDRAM BW

    150

    Netw

    DRAMStack

    DRAMStack

    DRAMStack

    NVMemory

    Cabinet 128 Nodes 2 56PF 3

  • 8/8/2019 Dally SC10

    43/55

    32 Modules, 4 Nodes/Module,Central Router Module(s), Dragonfly Interconnec

    NODE

    NODE

    NODE

    NODE

    MODULE

    NODE

    NODE

    NODE

    NODE

    MODULE

    ROUTER

    ROUTER

    ROUTER

    ROUTER

    MODULE

    NODE

    NODE

    NODE

    NODE

    MODULE

    Cabinet 128 Nodes 2.56PF 3

    System to ExaScale and Beyo

  • 8/8/2019 Dally SC10

    44/55

    Dragonfly Interconnect400 Cabinets is ~1EF and ~15MW

    System to ExaScale and Beyo

  • 8/8/2019 Dally SC10

    45/55

  • 8/8/2019 Dally SC10

    46/55

    GPU Computing Enables ExaScaleAt Reasonable Power2

    The GPU is the ComputerA general purpose computing engine, not just an a3

    GPU Computing is #1 TodayOn Top 500 AND Dominant on Green 5001

    GPU Computing is the Fut

    The Real Challenge is Software4

  • 8/8/2019 Dally SC10

    47/55

  • 8/8/2019 Dally SC10

    48/55

  • 8/8/2019 Dally SC10

    49/55

    Optimize the Storage Hierarchy

    2Tailor Memory to the Application3

    Data Movement Dominates Power1

    Power is THE Problem

    Some Applications Have Hierarchical Re-U

  • 8/8/2019 Dally SC10

    50/55

    pp

    0

    20

    40

    60

    80

    100

    120

    1.0E+0 1.0E+3 1.0E+6 1.0E+9

    %Miss

    Size

    DGEMM

    Applications with Hierarchical

  • 8/8/2019 Dally SC10

    51/55

    ppReuse Want a Deep Storage Hierarchy

    P P P P P P P P P P P P P P

    L2 L2 L2 L2

    L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

    L3

    Some Applications Have

  • 8/8/2019 Dally SC10

    52/55

    Plateaus in Their Working Sets

    0

    20

    40

    60

    80

    100

    120

    1.0E+0 1.0E+3 1.0E+6 1.0E+9

    %Miss

    Size

    Table

    Applications with Plateaus

  • 8/8/2019 Dally SC10

    53/55

    Want a Shallow Storage Hierarchy

    P P P P P P P P P P P P P P

    NoC

    L2 L2 L2 L

    L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

    Configurable Memory Can Do Both

  • 8/8/2019 Dally SC10

    54/55

    At the Same Time

    Flat hierarchy for large working sets

    Deep hierarchy for reuse

    Shared memory for explicit management

    Cache memory for unpredictable sharing

    P

    L1

    SRAM SRAM SRAM SR

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    P

    L1

    NoC

    Configurable Memory

  • 8/8/2019 Dally SC10

    55/55

    g yReduces Distance and Energy

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    P L1

    SRAM

    P L1

    SRAM

    P L1 P L1

    ROUTER

    ROUTER

    ROUTER

    ROUTER