Download - Asymmetric Multiprocessing
Asymmetric Multiprocessing for
High Performance ComputingStephen Jenks
Electrical Engineering and Computer Science
UC Irvine
[email protected]: NVIDIA
Overview and Motivation
Conventional Parallel Computing Was:◦ For scientists with large problem sets
◦ Complex and difficult
◦ Very expensive computers with limited access
Parallel Computing is Becoming:◦ Ubiquitous
◦ Cheap
◦ Essential
◦ Different
◦ Still complex and difficult
Where Is Parallel Computing Going?
And What are the Research Challenges?
2/25/2009 Stephen Jenks 2
Computing is Changing
Parallel programming is essential
◦ Clock speed not increasing much
◦ Performance gains require parallelism
Parallelism is changing
◦ Special purpose parallel engines
◦ CPU and parallel engine work together
◦ Different code on CPU & parallel engine
Asymmetric Computing
2/25/2009 Stephen Jenks 3
Conventional Processor
Architecture Hasn’t Changed Much For 40 Years
◦ Pipelining and Superscalar Since the 1960s
◦ But has become integrated Microprocessors
High Clock Speed
◦ Great performance
◦ High Power
◦ Cooling Issues
◦ Various
Solutions
2/25/2009 Stephen Jenks 4
From Hennessy & Patterson, 2nd Ed.
Parallel Computing Problem
Overview
2/25/2009 Stephen Jenks 5
Image Relaxation (Blur)
newimage[i][j] = (image[i][j] +
image[i][j-1] + image[i][j+1] +
image[i+1][j] + image[i-1][j]) / 5
Stencil
Shared Memory Multiprocessors
2/25/2009 Stephen Jenks 6
CPU
Cache
CPU
Cache
CPU
Cache
CPU
Cache
Shared
Memory
• Each CPU computes results for its partition
• Memory is shared so dependences satisfied
CPUs see common address space
Shared Memory Programming
Threads
◦ POSIX Threads
◦ Windows Threads
OpenMP
◦ User-inserted directives to compiler
◦ Loop parallelism
◦ Parallel regions
◦ Visual Studio
◦ GCC 4.2
2/25/2009 Stephen Jenks 7
Not Integrated with Compiler or Language
No idea if code is in thread or not
Poor optimizations
#pragma omp parallel for
for (i=0; i<n; i++)
a[i] = b[i] * c[i];
Multicore Processors
Several CPU Cores Per Chip
Shared Memory Shared Caches (sometimes) Lower Clock Speed
◦ Lower Power & Heat
◦ But Good Performance
Program with Threads Single Threaded Code
◦ Not Faster
◦ Majority of Code Today
2/25/2009 Stephen Jenks 8
Intel Core Duo
AMD Athlon 64 X2
Conventional Processors are
Dinosaurs So much circuitry dedicated to
keeping ALUs fed:
◦ Cache
◦ Out-of-order execution/reorder buffer
◦ Branch prediction
◦ Large Register Sets
◦ Simultaneous Multithreading
ALU tiny by comparison
Huge power for little performance gain
2/25/2009 Stephen Jenks 9With thanks to Stanford’s Pat Hanrahan for the analogy
AMD Phenom X4 Floorplan
2/25/2009 Stephen Jenks 10
Source: AMD.com
Single Nehalem (Intel I7) Core
2/25/2009 Stephen Jenks 11
Nehalem - Everything You Need to Know about Intel's New Architecture
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3382
NVIDIA GPU Floorplan
2/25/2009 Stephen Jenks 12
Source: Dr. Sumit Gupta - NVIDIA
Asymmetric Parallel Accelerators
Current Cores are Powerful and General◦ Some Applications Only Need Certain Operations
◦ Perhaps a Simpler Processor Could Be Faster
Pair General CPUs with Specialized Accelerators◦ Graphics Processing Unit
◦ Field ProgrammableGate Array (FPGA)
◦ Single Instruction,Multiple Data (SIMD)Processor
2/25/2009 Stephen Jenks 13
Athlon 64
CPU
ATI
GPU
XBAR
Hyper-
Transport
Memory
Controller
Possible Hybrid
AMD Multi-Core
Design
Graphics Processing Unit (GPU)
GPUs Do Pixel Pushing and Matrix Math
2/25/2009 Stephen Jenks 14
From
NVIDIA CUDA
Compute Unified
Device Architecture
Programming Guide
11/29/2007
CUDA Programming Model
Data Parallel◦ But not Loop Parallel
◦ Very Lightweight Threads
◦ Write Code from Thread’sPoint of View
No Shared Memory◦ Host Copies Data To
and From Device
◦ Different Memories
Hundreds of ParallelThreads (Sort-of SIMD)
2/25/2009 Stephen Jenks 15
Block
of
Threads
Finite Difference, Time Domain
(FDTD) Electromagnetic Simulation
2/25/2009 Stephen Jenks 16
Fields stored in 6 3D Arrays:
EX, EY, EZ, HX, HY, HZ
2 other coefficient arrays
O(N3) algorithm
/* update magnetic field */
for (i = 1; i < nx - 1; i++)
for (j = 1; j < ny - 1; j++) { /* combining Y loop */
for (k = 1; k < nz - 1; k++) {
ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);
ELEM_TYPE tmpx = rx*invmu;
ELEM_TYPE tmpy = ry*invmu;
ELEM_TYPE tmpz = rz*invmu;
AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -
tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));
AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -
tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));
AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -
tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));
}
/* now update the electric field */
for (k = 1; k < nz - 1; k++) {
ELEM_TYPE invep = 1.0f/AR_INDEX(ep,i,j,k);
ELEM_TYPE tmpx = rx*invep;
ELEM_TYPE tmpy = ry*invep;
ELEM_TYPE tmpz = rz*invep;
AR_INDEX(ex,i,j,k) += tmpy * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i,j-1,k)) -
tmpz * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i,j,k-1));
AR_INDEX(ey,i,j,k) += tmpz * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j,k-1)) -
tmpx * (AR_INDEX(hz,i,j,k) - AR_INDEX(hz,i-1,j,k));
AR_INDEX(ez,i,j,k) += tmpx * (AR_INDEX(hy,i,j,k) - AR_INDEX(hy,i-1,j,k)) -
tmpy * (AR_INDEX(hx,i,j,k) - AR_INDEX(hx,i,j-1,k));
}
}
OpenMP FDTD
2/25/2009 Stephen Jenks 17
/* update magnetic field */
#pragma omp parallel for private(i,j,k) \
shared(hx,hy,hz,ex,ey,ez)
for (i = 1; i < nx - 1; i++)
for (j = 1; j < ny - 1; j++) {
for (k = 1; k < nz - 1; k++) {
ELEM_TYPE invmu = 1.0f/AR_INDEX(mu,i,j,k);
ELEM_TYPE tmpx = rx*invmu;
ELEM_TYPE tmpy = ry*invmu;
ELEM_TYPE tmpz = rz*invmu;
AR_INDEX(hx,i,j,k) += tmpz * (AR_INDEX(ey,i,j,k+1) - AR_INDEX(ey,i,j,k)) -
tmpy * (AR_INDEX(ez,i,j+1,k) - AR_INDEX(ez,i,j,k));
AR_INDEX(hy,i,j,k) += tmpx * (AR_INDEX(ez,i+1,j,k) - AR_INDEX(ez,i,j,k)) -
tmpz * (AR_INDEX(ex,i,j,k+1) - AR_INDEX(ex,i,j,k));
AR_INDEX(hz,i,j,k) += tmpy * (AR_INDEX(ex,i,j+1,k) - AR_INDEX(ex,i,j,k)) -
tmpx * (AR_INDEX(ey,i+1,j,k) - AR_INDEX(ey,i,j,k));
}
}
#pragma omp parallel for private(i,j,k) \
shared(hx,hy,hz,ex,ey,ez)
Sequential OpenMP
Core2Quad PS3 Core2Quad PS3
26.7 28.6 8.8 26.5
Speedup 3.0 1.1
Because of Dependences,
Need to Split E & M Loops
CUDA FDTD
2/25/2009 Stephen Jenks 18
// Block index
int bx = blockIdx.x;
// because CUDA does not support 3D grids, we'll use the y block value to
// fake it. Unfortunately, this needs mod and divide (or a lookup table).
int by, bz;
by = blockIdx.y / gridZ;
bz = blockIdx.y % gridZ;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
int tz = threadIdx.z;
// this thread's indices in the global matrix (skip outer boundaries)
int i = bx * THREAD_COUNT_X + tx + 1;
int j = by * THREAD_COUNT_Y + ty + 1;
int k = bz * THREAD_COUNT_Z + tz + 1;
At this point, the thread knows where it
maps in problem space
Predefined variables provide
thread position in block and
block position in grid
CUDA FDTD (cont.)
2/25/2009 Stephen Jenks 19
float invep = 1.0f/AR_INDEX(d_ep,i,j,k);
float tmpx = d_rx*invep;
float tmpy = d_ry*invep;
float tmpz = d_rz*invep;
float* ex = d_ex; float* ey = d_ey; float* ez = d_ez;
AR_INDEX(ex,i,j,k) += tmpy*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i,j-1,k))-
tmpz*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i,j,k-1));
AR_INDEX(ey,i,j,k) += tmpz*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j,k-1))-
tmpx*(AR_INDEX(d_hz,i,j,k)-AR_INDEX(d_hz,i-1,j,k));
AR_INDEX(ez,i,j,k) += tmpx*(AR_INDEX(d_hy,i,j,k)-AR_INDEX(d_hy,i-1,j,k))-
tmpy*(AR_INDEX(d_hx,i,j,k)-AR_INDEX(d_hx,i,j-1,k));
Code looks normal, but:
Only one “iteration”
Arrays reside in device global memory
Machine Code Style Time (s) Speedup
Core2Quad Sequential 26.5 1
Core2Quad OpenMP 8.8 3.0
GeForce 9800GX2 CUDA 7.4 3.6
Better CUDA FDTD
Problem: CUDA Memory Accesses Slow
Problem: Not Much Work Done Per Thread
Idea 1: Copy Data to Fast Shared Memory◦ Allows Reuse By Neighbor Threads
◦ Takes Advantage of Coalesced Memory Accesses
Idea 2: Compute Multiple Elements Per Thread◦ But Each Block Only Gets 16KB Shared Space
◦ Barely Enough for 2 Points at 1x16x16 Threads
2/25/2009 Stephen Jenks 20
Machine Code Style Time (s) Speedup
Core2Quad Sequential 26.5 1
Core2Quad OpenMP 8.8 3.0
GeForce 9800GX2 CUDA 7.4 3.6
GeForce 9800GX2 CUDA faster 6.5 4.1
Cell Broadband Engine
PowerPC Processing Element with Simultaneous
Multithreading at 3.2 GHz
8 Synergistic Processing Elements at 3.2 GHz
◦Optimized for SIMD/Vector processing (100 GFLOPS Total)
◦256KB Local Storage - no cache
4x16-byte-wide rings @ 96 bytes per clock cycle
2/25/2009 Stephen Jenks 21
From IBM Cell
Broadband
Engine
Programmer
Handbook, 10
May 2006
Cell Programming Model
New IBM Single Source
Compiler Supports OpenMPExample SPE Code
PPE &SPEs Run
Separate Programs
◦ Programmed in C
◦ Explicitly Move Memory
◦ Hard to Debug SPE Code
◦ SPE Good at Arithmetic, not
Branching
◦ PPE &SPEs Communicate
via Mailboxes & Memory
◦ SPEs Communicate via
Signals & Memory
2/25/2009 Stephen Jenks 22
struct dma_list_eleminList[DMA_LIST_SIZE]
__attribute__ ((aligned (8)));
struct dma_list_elemoutList[DMA_LIST_SIZE]
__attribute__ ((aligned (8)));
void SendSignal1(program_data* destPD, int value)
{
sig1_local.SPU_Sig_Notify_1 = value;
mfc_sndsig(
&(sig1_local.SPU_Sig_Notify_1),
destPD->sig1_addr, 3/* tag */,
0, 0);
mfc_sync(1<<3);}
Research Agenda & Future
Directions Programming and Performance Being Pushed Back
to Software Developers◦ Big Problem!
◦ Software Less Well Tested/Developed Than Hardware
◦ Architecture Affects Performance In Strange Ways
◦ Compiler Solutions (IBM SSC) Don’t Always Work Well
Extract Top Performance From Multicore CPUs
Make Multicore Processors Better to Program
Explore & Exploit Asymmetric Computing
2/25/2009 Stephen Jenks 23
Parallel Microprocessor Problems
Memory interface too slow for 1 core/thread
Now multiple threads access memory simultaneously, overwhelming memory interface
Parallel programs can run as slowly as sequential ones!
2/25/2009Stephen Jenks24
CPU
L2 C
ache
Syste
m/
Mem
I/F
Mem
Then
CPU1
CPU2
L2 C
ache
Syste
m/
Mem
I/F
Mem
Now
SPPM: Producer/Consumer
Parallelism Using The Cache
2/25/2009 Stephen Jenks 25
Thread 1
Half the Work
Thread 2
Half the Work
Data in Memory
Memory
Bottleneck
Producer
Thread
Consumer
Thread
Data in Memory
Communications
Through Cache
Multicore Architecture
Improvements Cores in Common Multicore Chips Not Well Connected
◦ Communicate Through Cache & Memory
◦ Synchronization Slow (OS-based) or Uses Spin Waits
Register-Based Synchronization
◦ Shared Registers Between Cores
◦ Halt Processor While Waiting To Save Power
Preemptive Communications (Prepushing)
◦ Reduces Latency over Demand-Based Fetches
◦ Cuts Cache Coherence Traffic/Activity
Software Controlled Eviction
◦ Manages Shared Caches Using Explicit Operations
◦ Move Data to Shared Cache Before Needed From Private Cache
Synchronization Engine
◦ Hardware-based multicore synchronization operations
2/25/2009 Stephen Jenks 26
Asymmetric Multicore
Processors Intel & AMD Likely to Add GPUs to CPUs in Multicore
Processors
Connectivity Better than PCIe
Cache Access Possible◦ Shared Memory Likely
◦ Programming Slightly Better
Multiple CPU Classes Possible◦ Fast & Powerful for Some Apps
◦ Simple & Power Efficient for Most
Thoughts:◦ Not many multithreaded apps, so
don’t need many CPUs
◦ Application accelerators speed up multimedia, etc.
◦ Probably Cost Effective to Integrate CPU & GPU
2/25/2009 Stephen Jenks 27
Athlon 64
CPU
ATI
GPU
XBAR
Hyper-
Transport
Memory
Controller
Possible Hybrid AMD
Multicore Design
FPGA Application Accelerators
Cray and SGI Add FPGA Application Accelerators to High End Parallel Machines
◦ FPGA Code Created Through VHDL or Other Compiler
◦ Can Speed Up Some Applications
My Thoughts
◦ FPGA Great for Prototyping
◦ General & Flexible
◦ But Have Slow Clock Speed
◦ So Will Be Quickly Surpassed By GPUs & Special Purpose CPUs
2/25/2009 Stephen Jenks 28
Intel Larrabee
2/25/2009 Stephen Jenks 29
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Simple
x86 Core
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Coherent
L2 Cache
Interprocessor Ring Network
Many simple, fast, low power, in-order x86 cores
Larrabee: A Many-Core x86 Architecture for Visual Computing
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
Mem
ory
& I/O
Inte
rfaces
OpenCL
New standard pushed by Apple &
others (all major players)
Data Parallel
◦ N-Dimensional computational domain
◦ Work-groups – similar to CUDA block
◦ Work-items – similar to CUDA threads
Task Parallel
◦ Similar to threads, but single work-item
Supports GPU, multicore, Cell, etc.
2/25/2009 Stephen Jenks 30
Summary
Parallelism Big Resurgence◦ Ubiquitous
◦ Still Difficult (Programming Challenge Worries Intel, Microsoft, & IBM)
New Types Of Parallelism Becoming Widespread◦ Multicore Processors But They Suffer From Bottlenecks
And Are Not Well Integrated
◦ GPU Processing Hard to Program
Especially Hard to Get Good Performance
◦ Asymmetric Processor Cores Hard to Program and Manage
Also Especially Hard to Get Good Performance
Many Architecture, Hardware, and Software Research Challenges
2/25/2009 Stephen Jenks 31
Resources
http://www.ibm.com/developerworks/power/cell - Cell Tools &
Docs
http://www.nvidia.com/object/cuda_home.html - CUDA Tools
& Docs
http://spds.ece.uci.edu/~sjenks/Pages/ParallelMicros.html -
SPPM & Parallel Execution Models
http://web.mac.com/sfj/iWeb/Site/Welcome.html - Old Parallel
Programming Blog (Source code for FDTD Examples)
http://newport.eecs.uci.edu/~sjenks – My homepage, where
this presentation is posted
2/25/2009 Stephen Jenks 32