asymmetric multiprocessing

Download Asymmetric Multiprocessing

Post on 21-Oct-2015

4 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Asymmetric Multiprocessing for

    High Performance ComputingStephen Jenks

    Electrical Engineering and Computer Science

    UC Irvine

    sjenks@uci.eduSource: NVIDIA

  • Overview and Motivation

    Conventional Parallel Computing Was: For scientists with large problem sets Complex and difficult Very expensive computers with limited access

    Parallel Computing is Becoming: Ubiquitous Cheap Essential Different Still complex and difficult

    Where Is Parallel Computing Going?

    And What are the Research Challenges?

    2/25/2009 Stephen Jenks 2

  • Computing is Changing

    Parallel programming is essential

    Clock speed not increasing much

    Performance gains require parallelism

    Parallelism is changing

    Special purpose parallel engines

    CPU and parallel engine work together

    Different code on CPU & parallel engine Asymmetric Computing

    2/25/2009 Stephen Jenks 3

  • Conventional Processor

    Architecture Hasnt Changed Much For 40 Years

    Pipelining and Superscalar Since the 1960s

    But has become integrated Microprocessors

    High Clock Speed

    Great performance

    High Power

    Cooling Issues

    VariousSolutions

    2/25/2009 Stephen Jenks 4

    From Hennessy & Patterson, 2nd Ed.

  • Parallel Computing Problem

    Overview

    2/25/2009 Stephen Jenks 5

    Image Relaxation (Blur)

    newimage[i][j] = (image[i][j] +

    image[i][j-1] + image[i][j+1] +

    image[i+1][j] + image[i-1][j]) / 5

    Stencil

  • Shared Memory Multiprocessors

    2/25/2009 Stephen Jenks 6

    CPU

    Cache

    CPU

    Cache

    CPU

    Cache

    CPU

    Cache

    Shared

    Memory

    Each CPU computes results for its partition Memory is shared so dependences satisfied

    CPUs see common address space

  • Shared Memory Programming

    Threads

    POSIX Threads

    Windows Threads

    OpenMP

    User-inserted directives to compiler

    Loop parallelism

    Parallel regions

    Visual Studio

    GCC 4.2

    2/25/2009 Stephen Jenks 7

    Not Integrated with Compiler or Language

    No idea if code is in thread or notPoor optimizations

    #pragma omp parallel for

    for (i=0; i

  • Multicore Processors

    Several CPU Cores Per Chip

    Shared Memory Shared Caches (sometimes) Lower Clock Speed

    Lower Power & Heat But Good Performance

    Program with Threads Single Threaded Code

    Not Faster Majority of Code Today

    2/25/2009 Stephen Jenks 8

    Intel Core Duo

    AMD Athlon 64 X2

  • Conventional Processors are

    Dinosaurs So much circuitry dedicated to

    keeping ALUs fed:

    Cache

    Out-of-order execution/reorder buffer

    Branch prediction

    Large Register Sets

    Simultaneous Multithreading

    ALU tiny by comparison

    Huge power for little performance gain

    2/25/2009 Stephen Jenks 9With thanks to Stanfords Pat Hanrahan for the analogy

  • AMD Phenom X4 Floorplan

    2/25/2009 Stephen Jenks 10

    Source: AMD.com

  • Single Nehalem (Intel I7) Core

    2/25/2009 Stephen Jenks 11

    Nehalem - Everything You Need to Know about Intel's New Architecture

    http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382

  • NVIDIA GPU Floorplan

    2/25/2009 Stephen Jenks 12

    Source: Dr. Sumit Gupta - NVIDIA

  • Asymmetric Parallel Accelerators

    Current Cores are Powerful and General Some Applications Only Need Certain Operations Perhaps a Simpler Processor Could Be Faster

    Pair General CPUs with Specialized Accelerators Graphics Processing Unit Field Programmable

    Gate Array (FPGA)

    Single Instruction,Multiple Data (SIMD)Processor

    2/25/2009 Stephen Jenks 13

    Athlon 64

    CPU

    ATI

    GPU

    XBAR

    Hyper-

    Transport

    Memory

    Controller

    Possible Hybrid

    AMD Multi-Core

    Design

  • Graphics Processing Unit (GPU)

    GPUs Do Pixel Pushing and Matrix Math

    2/25/2009 Stephen Jenks 14

    From

    NVIDIA CUDA

    Compute Unified

    Device Architecture

    Programming Guide

    11/29/2007

  • CUDA Programming Model

    Data Parallel But not Loop Parallel

    Very Lightweight Threads

    Write Code from ThreadsPoint of View

    No Shared Memory Host Copies Data To

    and From Device

    Different Memories

    Hundreds of ParallelThreads (Sort-of SIMD)

    2/25/2009 Stephen Jenks 15

    Block

    of

    Threads

  • Finite Difference, Time Domain

    (FDTD) Electromagnetic Simulation

    2/25/2009 Stephen Jenks 16

    Fields stored in 6 3D Arrays:

    EX, EY, EZ, HX, HY, HZ

    2 other coefficient arrays

    O(N3) algorithm

    /* update magnetic field */

    for (i = 1; i < nx - 1; i++)

    for (j = 1; j < ny - 1; j++) { /* combining Y loop */

    for (k = 1; k < nz - 1; k++) {

    ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

    ELEM_TYPE tmpx = rx*invmu;

    ELEM_TYPE tmpy = ry*invmu;

    ELEM_TYPE tmpz = rz*invmu;

    AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

    tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

    AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

    tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

    AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

    tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

    }

    /* now update the electric field */

    for (k = 1; k < nz - 1; k++) {

    ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k);

    ELEM_TYPE tmpx = rx*invep;

    ELEM_TYPE tmpy = ry*invep;

    ELEM_TYPE tmpz = rz*invep;

    AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) -

    tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1));

    AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) -

    tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k));

    AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) -

    tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k));

    }

    }

  • OpenMP FDTD

    2/25/2009 Stephen Jenks 17

    /* update magnetic field */

    #pragma omp parallel for private(i,j,k) \

    shared(hx,hy,hz,ex,ey,ez)

    for (i = 1; i < nx - 1; i++)

    for (j = 1; j < ny - 1; j++) {

    for (k = 1; k < nz - 1; k++) {

    ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

    ELEM_TYPE tmpx = rx*invmu;

    ELEM_TYPE tmpy = ry*invmu;

    ELEM_TYPE tmpz = rz*invmu;

    AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

    tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

    AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

    tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

    AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

    tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

    }

    }

    #pragma omp parallel for private(i,j,k) \

    shared(hx,hy,hz,ex,ey,ez)

    Sequential OpenMP

    Core2Quad PS3 Core2Quad PS3

    26.7 28.6 8.8 26.5

    Speedup 3.0 1.1

    Because of Dependences,

    Need to Split E & M Loops

  • CUDA FDTD

    2/25/2009 Stephen Jenks 18

    // Block index

    int bx = blockIdx.x;

    // because CUDA does not support 3D grids, we'll use the y block value to

    // fake it. Unfortunately, this needs mod and divide (or a lookup table).

    int by, bz;

    by = blockIdx.y / gridZ;

    bz = blockIdx.y % gridZ;

    // Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

    int tz = threadIdx.z;

    // this thread's indices in the global matrix (skip outer boundaries)

    int i = bx * THREAD_COUNT_X + tx + 1;

    int j = by * THREAD_COUNT_Y + ty + 1;

    int k = bz * THREAD_COUNT_Z + tz + 1;

    At this point, the thread knows where it

    maps in problem space

    Predefined variables provide

    thread position in block and

    block position in grid

  • CUDA FDTD (cont.)

    2/25/2009 Stephen Jenks 19

    float invep = 1.0f/AR_INDEX(d_ep,i,j,k);

    float tmpx = d_rx*invep;

    float tmpy = d_ry*invep;

    float tmpz = d_rz*invep;

    float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;

    AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))-

    tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1));

    AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))-

    tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k));

    AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))-

    tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));

    Code looks normal, but:

    Only one iterationArrays reside in device global memory

    Machine Code Style Time (s) Speedup

    Core2Quad Sequential 26.5 1

    Core2Quad OpenMP 8.8 3.0

    GeForce 9800GX2 CUDA 7.4 3.6

  • Better CUDA FDTD

    Problem: CUDA Memory Accesses Slow

    Problem: Not Much Work Done Per Thread

    Idea 1: Copy Data to Fast Shared Memory Allows Reuse By Neighbor Threads Takes Advantage of Coalesced Memory Accesses

    Idea 2: Compute Multiple Elements Per Thread But Each Block Only Gets 16KB Shared Space Barely Enough for 2 Points at 1x16x16 Threads

    2/25/2009 Stephen Jenks 20

    Machine Code Style Time (s) Speedup

    Core2Quad Sequential 26.5 1

    Core2Quad OpenMP 8.8 3.0

    GeForce 9800GX2 CUDA 7.4 3.6

    GeForce 9800GX2 CUDA faster 6.5 4.1

  • Cell Broadband Engine

    PowerPC Processing Element with Simultaneous

    Multithreading at 3.2 GHz

Recommended

View more >