asymmetric multiprocessing

Asymmetric Multiprocessing for

High Performance ComputingStephen Jenks

Electrical Engineering and Computer Science

UC Irvine

[email protected]: NVIDIA

Overview and Motivation

Conventional Parallel Computing Was:◦ For scientists with large problem sets

◦ Complex and difficult

◦ Very expensive computers with limited access

Parallel Computing is Becoming:◦ Ubiquitous

◦ Cheap

◦ Essential

◦ Different

◦ Still complex and difficult

Where Is Parallel Computing Going?

And What are the Research Challenges?

2/25/2009 Stephen Jenks 2

Computing is Changing

Parallel programming is essential

◦ Clock speed not increasing much

◦ Performance gains require parallelism

Parallelism is changing

◦ Special purpose parallel engines

◦ CPU and parallel engine work together

◦ Different code on CPU & parallel engine

Asymmetric Computing


Conventional Processor

Architecture Hasn’t Changed Much For 40 Years

◦ Pipelining and Superscalar Since the 1960s

◦ But has become integrated Microprocessors

High Clock Speed

◦ Great performance

◦ High Power

◦ Cooling Issues

◦ Various

Solutions


From Hennessy & Patterson, 2nd Ed.

Parallel Computing Problem

Overview


Image Relaxation (Blur)

newimage[i][j] = (image[i][j] +

image[i][j-1] + image[i][j+1] +

image[i+1][j] + image[i-1][j]) / 5

Stencil

Shared Memory Multiprocessors


CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Shared

Memory

• Each CPU computes results for its partition

• Memory is shared so dependences satisfied

CPUs see common address space

Shared Memory Programming

Threads

◦ POSIX Threads

◦ Windows Threads

OpenMP

◦ User-inserted directives to compiler

◦ Loop parallelism

◦ Parallel regions

◦ Visual Studio

◦ GCC 4.2


Not Integrated with Compiler or Language

No idea if code is in thread or not

Poor optimizations

#pragma omp parallel for

for (i=0; i<n; i++)

a[i] = b[i] * c[i];

Multicore Processors

Several CPU Cores Per Chip

Shared Memory Shared Caches (sometimes) Lower Clock Speed

◦ Lower Power & Heat

◦ But Good Performance

Program with Threads Single Threaded Code

◦ Not Faster

◦ Majority of Code Today


Intel Core Duo

AMD Athlon 64 X2

Conventional Processors are

Dinosaurs So much circuitry dedicated to

keeping ALUs fed:

◦ Cache

◦ Out-of-order execution/reorder buffer

◦ Branch prediction

◦ Large Register Sets

◦ Simultaneous Multithreading

ALU tiny by comparison

Huge power for little performance gain

2/25/2009 Stephen Jenks 9With thanks to Stanford’s Pat Hanrahan for the analogy

AMD Phenom X4 Floorplan


Source: AMD.com

Single Nehalem (Intel I7) Core


Nehalem - Everything You Need to Know about Intel's New Architecture

http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382

NVIDIA GPU Floorplan


Source: Dr. Sumit Gupta - NVIDIA

Asymmetric Parallel Accelerators

Current Cores are Powerful and General◦ Some Applications Only Need Certain Operations

◦ Perhaps a Simpler Processor Could Be Faster

Pair General CPUs with Specialized Accelerators◦ Graphics Processing Unit

◦ Field ProgrammableGate Array (FPGA)

◦ Single Instruction,Multiple Data (SIMD)Processor


Athlon 64

CPU

ATI

GPU

XBAR

Hyper-

Transport

Memory

Controller

Possible Hybrid

AMD Multi-Core

Design

Graphics Processing Unit (GPU)

GPUs Do Pixel Pushing and Matrix Math


From

NVIDIA CUDA

Compute Unified

Device Architecture

Programming Guide

11/29/2007

CUDA Programming Model

Data Parallel◦ But not Loop Parallel

◦ Very Lightweight Threads

◦ Write Code from Thread’sPoint of View

No Shared Memory◦ Host Copies Data To

and From Device

◦ Different Memories

Hundreds of ParallelThreads (Sort-of SIMD)


Block

of

Threads

Finite Difference, Time Domain

(FDTD) Electromagnetic Simulation


Fields stored in 6 3D Arrays:

EX, EY, EZ, HX, HY, HZ

2 other coefficient arrays

O(N3) algorithm

/* update magnetic field */

for (i = 1; i < nx - 1; i++)

for (j = 1; j < ny - 1; j++) { /* combining Y loop */

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

ELEM_TYPE tmpx = rx*invmu;

ELEM_TYPE tmpy = ry*invmu;

ELEM_TYPE tmpz = rz*invmu;

AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

}

/* now update the electric field */

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k);

ELEM_TYPE tmpx = rx*invep;

ELEM_TYPE tmpy = ry*invep;

ELEM_TYPE tmpz = rz*invep;

AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) -

tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1));

AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) -

tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k));

AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) -

tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k));

}

}

OpenMP FDTD


/* update magnetic field */

#pragma omp parallel for private(i,j,k) \

shared(hx,hy,hz,ex,ey,ez)

for (i = 1; i < nx - 1; i++)

for (j = 1; j < ny - 1; j++) {

for (k = 1; k < nz - 1; k++) {

ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);

ELEM_TYPE tmpx = rx*invmu;

ELEM_TYPE tmpy = ry*invmu;

ELEM_TYPE tmpz = rz*invmu;

AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -

tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));

AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -

tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));

AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -

tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));

}

}

#pragma omp parallel for private(i,j,k) \

shared(hx,hy,hz,ex,ey,ez)

Sequential OpenMP

Core2Quad PS3 Core2Quad PS3

26.7 28.6 8.8 26.5

Speedup 3.0 1.1

Because of Dependences,

Need to Split E & M Loops

CUDA FDTD


// Block index

int bx = blockIdx.x;

// because CUDA does not support 3D grids, we'll use the y block value to

// fake it. Unfortunately, this needs mod and divide (or a lookup table).

int by, bz;

by = blockIdx.y / gridZ;

bz = blockIdx.y % gridZ;

// Thread index

int tx = threadIdx.x;

int ty = threadIdx.y;

int tz = threadIdx.z;

// this thread's indices in the global matrix (skip outer boundaries)

int i = bx * THREAD_COUNT_X + tx + 1;

int j = by * THREAD_COUNT_Y + ty + 1;

int k = bz * THREAD_COUNT_Z + tz + 1;

At this point, the thread knows where it

maps in problem space

Predefined variables provide

thread position in block and

block position in grid

CUDA FDTD (cont.)


float invep = 1.0f/AR_INDEX(d_ep,i,j,k);

float tmpx = d_rx*invep;

float tmpy = d_ry*invep;

float tmpz = d_rz*invep;

float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;

AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))-

tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1));

AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))-

tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k));

AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))-

tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));

Code looks normal, but:

Only one “iteration”

Arrays reside in device global memory

Machine Code Style Time (s) Speedup

Core2Quad Sequential 26.5 1

Core2Quad OpenMP 8.8 3.0

GeForce 9800GX2 CUDA 7.4 3.6

Better CUDA FDTD

Problem: CUDA Memory Accesses Slow

Problem: Not Much Work Done Per Thread

Idea 1: Copy Data to Fast Shared Memory◦ Allows Reuse By Neighbor Threads

◦ Takes Advantage of Coalesced Memory Accesses

Idea 2: Compute Multiple Elements Per Thread◦ But Each Block Only Gets 16KB Shared Space

◦ Barely Enough for 2 Points at 1x16x16 Threads


Machine Code Style Time (s) Speedup

Core2Quad Sequential 26.5 1

Core2Quad OpenMP 8.8 3.0

GeForce 9800GX2 CUDA 7.4 3.6

GeForce 9800GX2 CUDA faster 6.5 4.1

Cell Broadband Engine

PowerPC Processing Element with Simultaneous

Multithreading at 3.2 GHz

8 Synergistic Processing Elements at 3.2 GHz

◦Optimized for SIMD/Vector processing (100 GFLOPS Total)

◦256KB Local Storage - no cache

4x16-byte-wide rings @ 96 bytes per clock cycle


From IBM Cell

Broadband

Engine

Programmer

Handbook, 10

May 2006

Cell Programming Model

New IBM Single Source

Compiler Supports OpenMPExample SPE Code

PPE &SPEs Run

Separate Programs

◦ Programmed in C

◦ Explicitly Move Memory

◦ Hard to Debug SPE Code

◦ SPE Good at Arithmetic, not

Branching

◦ PPE &SPEs Communicate

via Mailboxes & Memory

◦ SPEs Communicate via

Signals & Memory


struct dma_list_eleminList[DMA_LIST_SIZE]

__attribute__ ((aligned (8)));

struct dma_list_elemoutList[DMA_LIST_SIZE]

__attribute__ ((aligned (8)));

void SendSignal1(program_data* destPD, int value)

{

sig1_local.SPU_Sig_Notify_1 = value;

mfc_sndsig(

&(sig1_local.SPU_Sig_Notify_1),

destPD->sig1_addr, 3/* tag */,

0, 0);

mfc_sync(1<<3);}

Research Agenda & Future

Directions Programming and Performance Being Pushed Back

to Software Developers◦ Big Problem!

◦ Software Less Well Tested/Developed Than Hardware

◦ Architecture Affects Performance In Strange Ways

◦ Compiler Solutions (IBM SSC) Don’t Always Work Well

Extract Top Performance From Multicore CPUs

Make Multicore Processors Better to Program

Explore & Exploit Asymmetric Computing


Parallel Microprocessor Problems

Memory interface too slow for 1 core/thread

Now multiple threads access memory simultaneously, overwhelming memory interface

Parallel programs can run as slowly as sequential ones!

2/25/2009Stephen Jenks24

CPU

L2 C

ache

Syste

m/

Mem

I/F

Mem

Then

CPU1

CPU2

L2 C

ache

Syste

m/

Mem

I/F

Mem

Now

SPPM: Producer/Consumer

Parallelism Using The Cache


Thread 1

Half the Work

Thread 2

Half the Work

Data in Memory

Memory

Bottleneck

Producer

Thread

Consumer

Thread

Data in Memory

Communications

Through Cache

Multicore Architecture

Improvements Cores in Common Multicore Chips Not Well Connected

◦ Communicate Through Cache & Memory

◦ Synchronization Slow (OS-based) or Uses Spin Waits

Register-Based Synchronization

◦ Shared Registers Between Cores

◦ Halt Processor While Waiting To Save Power

Preemptive Communications (Prepushing)

◦ Reduces Latency over Demand-Based Fetches

◦ Cuts Cache Coherence Traffic/Activity

Software Controlled Eviction

◦ Manages Shared Caches Using Explicit Operations

◦ Move Data to Shared Cache Before Needed From Private Cache

Synchronization Engine

◦ Hardware-based multicore synchronization operations


Asymmetric Multicore

Processors Intel & AMD Likely to Add GPUs to CPUs in Multicore

Processors

Connectivity Better than PCIe

Cache Access Possible◦ Shared Memory Likely

◦ Programming Slightly Better

Multiple CPU Classes Possible◦ Fast & Powerful for Some Apps

◦ Simple & Power Efficient for Most

Thoughts:◦ Not many multithreaded apps, so

don’t need many CPUs

◦ Application accelerators speed up multimedia, etc.

◦ Probably Cost Effective to Integrate CPU & GPU


Athlon 64

CPU

ATI

GPU

XBAR

Hyper-

Transport

Memory

Controller

Possible Hybrid AMD

Multicore Design

FPGA Application Accelerators

Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines

◦ FPGA Code Created Through VHDL or Other Compiler

◦ Can Speed Up Some Applications

My Thoughts

◦ FPGA Great for Prototyping

◦ General & Flexible

◦ But Have Slow Clock Speed

◦ So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs


Intel Larrabee


Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Simple

x86 Core

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Coherent

L2 Cache

Interprocessor Ring Network

Many simple, fast, low power, in-order x86 cores

Larrabee: A Many-Core x86 Architecture for Visual Computing

ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.

Mem

ory

& I/O

Inte

rfaces

OpenCL

New standard pushed by Apple &

others (all major players)

Data Parallel

◦ N-Dimensional computational domain

◦ Work-groups – similar to CUDA block

◦ Work-items – similar to CUDA threads

Task Parallel

◦ Similar to threads, but single work-item

Supports GPU, multicore, Cell, etc.


Summary

Parallelism Big Resurgence◦ Ubiquitous

◦ Still Difficult (Programming Challenge Worries Intel, Microsoft, & IBM)

New Types Of Parallelism Becoming Widespread◦ Multicore Processors But They Suffer From Bottlenecks

And Are Not Well Integrated

◦ GPU Processing Hard to Program

Especially Hard to Get Good Performance

◦ Asymmetric Processor Cores Hard to Program and Manage

Also Especially Hard to Get Good Performance

Many Architecture, Hardware, and Software Research Challenges


Resources

http://www.ibm.com/developerworks/power/cell - Cell Tools &

Docs

http://www.nvidia.com/object/cuda_home.html - CUDA Tools

& Docs

http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html -

SPPM & Parallel Execution Models

http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel

Programming Blog (Source code for FDTD Examples)

http://newport.eecs.uci.edu/~sjenks – My homepage, where

this presentation is posted


http://www.ibm.com/developerworks/power/cell

http://www.nvidia.com/object/cuda_home.html

http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html

http://web.mac.com/sfj/iWeb/Site/Welcome.html

http://newport.eecs.uci.edu/~sjenks

asymmetric multiprocessing

Documents

parallel programming

matrix math2252009 stephen

parallel engine work

pragma omp parallel

performance gains

cpu cores

different code

parallelism parallelism