challenges in binary translation for desktop supercomputing david kaeli rodrigo dominguez department...

Challenges in Binary Translation for Desktop Supercomputing

Challenges in Binary Translation for Desktop Supercomputing

David KaeliRodrigo Dominguez

Department of Electrical and Computer EngineeringNortheastern University

Boston, MA

Current trends in Many-core ComputingCurrent trends in Many-core Computing

The CPU industry has elected to jump off the cycle-time scaling bandwagon Power/thermal constraints have become a limiting factor We now see CPU vendors placing multiple (10’s of) cores

on a single chip Clock speeds have not changed The memory wall persists and multiple cores that assume

a shared-memory model place further pressure on this problem

Software vendors are looking for new parallelization technology Multi-core aware operating systems Semi-automatic parallelizing compilers

The CPU industry has elected to jump off the cycle-time scaling bandwagon Power/thermal constraints have become a limiting factor We now see CPU vendors placing multiple (10’s of) cores

on a single chip Clock speeds have not changed The memory wall persists and multiple cores that assume

a shared-memory model place further pressure on this problem

Software vendors are looking for new parallelization technology Multi-core aware operating systems Semi-automatic parallelizing compilers

Current trends in Many-core ComputingCurrent trends in Many-core Computing

There has been a renewed interest in parallel computing paradigms and languages

Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs)

Heterogeneous systems are becoming a common theme

The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources

There has been a renewed interest in parallel computing paradigms and languages

Existing many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs)

Heterogeneous systems are becoming a common theme

The trend will only accelerate if proper programming frameworks are available to effectively exploit many-core resources

Graphics ProcessorsGraphics ProcessorsGraphics ProcessorsGraphics Processors

Graphics Processing Units More than 64% of Americans played a video game in 2009

High-end - primarily used for 3-D rendering for videogame graphics and movie animation

Mid/low-end – primarily used for computer displays

Manufacturers include NVIDIA, AMD/ATI, IBM-Cell

Very competitive commodities market

Graphics Processing Units More than 64% of Americans played a video game in 2009

High-end - primarily used for 3-D rendering for videogame graphics and movie animation

Mid/low-end – primarily used for computer displays

Manufacturers include NVIDIA, AMD/ATI, IBM-Cell

Very competitive commodities market

GPU Performance GPU Performance GPU Performance GPU Performance

GPUs provide a path for performance growth Cost and power usage numbers are also impressive

GPUs provide a path for performance growth Cost and power usage numbers are also impressive

Source:NVIDIA 2009

Near exponential growth

in performancefor GPUS!!

Comparison of CPU and GPU Hardware ArchitecturesComparison of CPU and GPU Hardware Architectures

CPU: Cache heavy, focused on individual thread performance

GPU: ALU heavy, massively parallel,

throughput-oriented

CPU/GPU Relationship CPU/GPU Relationship

CPU(host)CPU

(host)GPU w/

local DRAM(device)

A wide range of GPU appsA wide range of GPU apps

3D image analysis Adaptive radiation therapy Acoustics Astronomy Audio Automobile vision Bioinfomatics Biological simulation Broadcast Cellular automata Fluid dynamics Computer vision Cryptography CT reconstruction Data mining Digital cinema / projections Electromagnetic simulation Equity training

Film Financial Languages GIS Holographics cinema Machine learning Mathematics research Military Mine planning Molecular dynamics MRI reconstruction Multispectral imaging N-body simulation Network processing Neural network Oceanographic research Optical inspection Particle physics

Protein folding Quantum chemistry Ray tracing Radar Reservoir simulation Robotic vision / AI Robotic surgery Satellite data

analysis Seismic imaging Surgery simulation Surveillance Ultrasound Video conferencing Telescope Video Visualization Wireless X-Ray

GPU as a General Purpose Computing PlatformGPU as a General Purpose Computing Platform

Speedups are impressive and ever increasing! Speedups are impressive and ever increasing!

Genetic Algorithm

2600 X

Real Time Eliminationof Undersampling Artifacts

2300 X

Lattice-Boltzmann Methodfor Numerical Fluid Mechanics

1840 X

Source: CUDA Zone at www.nvidia.com/cuda/

Total Variation Modeling

1000 X

Fast Total Variation forComputer Vision

1000 X

Monte Carlo SimulationOf Photon Migration

1000 X

Stochastic DifferentialEquations

675 X

K-Nearest NeighborSearch470 X

GPGPU is becoming mainstream researchGPGPU is becoming mainstream research

Research activities are expanding significantlyResearch activities are expanding significantly

Search result for keyword “GPGPU” in IEEE and ACM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Texture Processor Cluster

SM

Streaming Processor Array

Streaming Multiprocessor

Te

xtu

re U

nit

TPC TPC TPC TPC TPC TPC TPC TPC TPC TPC

SM

SM

NVIDIA GT200 architecture

Grid of thread blocks

Multiple thread blocks, many warps of threads

Individual threads

• 240 shader cores• 1.4B transistors• Up to 2GB onboard memory• ~150GB/sec BW• 1.06 SP GFLOPS• CUDA and OpenCL support• Programmable memory spaces• Tesla S1070 provides 4 GPUs in a 1U unit

AMD/ATI Radeon HD 5870AMD/ATI Radeon HD 5870

• Codename “Evergreen” • 1600 SIMD cores

• L1/L2 memory architecture

• 153GB/sec memory bandwidth

• 2.72 TFLOPS SP

• OpenCL and DirectX11

• Hidden memory microarchitecure

• Provides for vectorized operation

Comparison of CPU and GPU Hardware ArchitecturesComparison of CPU and GPU Hardware Architectures

CPU/GPU Single precision TFLOPs

Cores GFLOPs/Watt

$/GFLOP

NVIDIA 285 1.06 240 5.8 $3.12

NVIDIA 295 1.79 480 6.2 $3.80

AMD HD 5870 2.72 1600 14.5 $0.16

AMD HD 4890 1.36 800 7.2 $0.18

Intel I-7 965 0.051 4 0.39 $11.02

Source: NVIDIA, AMD and Intel

AMD NVIDIA

Hardware architecture Vector Scalar

Programming language Brook+, IL, OpenCL CUDA, OpenCL

Programming model SIMD vector SIMT

Thread hierarchy Single level Two level

Memory exposure Uniform space Multiple space

Source of horsepower Vectorization and multiple output

Memory spaces utilization including shared memory

Pros Easier programming More flexible programming

Challenges Harnessing the potential horsepower

AMD vs. NVIDIAAMD vs. NVIDIA

Talk OutlineTalk Outline

Introduction on GPUs Overview of the tool chains for both CUDA and

OpenCL Motivation for pursuing this work

Comparing intermediate representations Leveraging/analyzing benefits of Open64 optimization on

AMD GPUs Comparing challenges with fundamentally different ISAs (SS

SIMT versus VLIW SIMT) Discuss PTX and IL Describe new common IR Two examples of PTX->IR->IL binary translation Discuss status of project and future work

Introduction on GPUs Overview of the tool chains for both CUDA and

OpenCL Motivation for pursuing this work

Comparing intermediate representations Leveraging/analyzing benefits of Open64 optimization on

AMD GPUs Comparing challenges with fundamentally different ISAs (SS

SIMT versus VLIW SIMT) Discuss PTX and IL Describe new common IR Two examples of PTX->IR->IL binary translation Discuss status of project and future work

GPU Programming ModelGPU Programming Model

Single Instruction Multiple Threads (SIMT)

Parallelism is implicit

Programs (also called kernels or shaders) are generally small and contain nested loops

Synchronization is handled explicitly

Single Instruction Multiple Threads (SIMT)

Parallelism is implicit

Programs (also called kernels or shaders) are generally small and contain nested loops

Synchronization is handled explicitly

ToolchainsToolchains

Toolchain = compiler + runtime library Toolchain = compiler + runtime library

NVIDIA AMD

GPU

CUDA Runtime

C for CUDA

OpenCL

Graphics driver

GPU

CAL Runtime

Brook+ OpenCL

Graphics driver

CUDA CompilerCUDA Compiler

cudafe

Open64

host compiler

runtime

host

gpu

ptx*

exe

binary

compile-time

execution-time

c for cuda

* ptx is included as data in the host application

driver

OpenCL (Dynamic) CompilerOpenCL (Dynamic) Compiler

OpenCL Library

LLVM

runtimebinary

execution-time

OpenCL

driver

compile-time

exe

host compiler

Objectives of our workObjectives of our work

Compare two different IRs from similar massively-threaded architectures

Influence future IR design (an active topic in GPGPU research)

Leverage/analyze benefits of Open64 optimizations

Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT

Compare two different IRs from similar massively-threaded architectures

Influence future IR design (an active topic in GPGPU research)

Leverage/analyze benefits of Open64 optimizations

Compare challenges with fundamentally different ISAs: Superscalar/SIMT versus VLIW/SIMT

CUDA RuntimeCUDA Runtime

• Device Management– cudaSetDevice, cudaGetDevice

• Memory Management– Allocation: cudaMalloc, cudaFree– Transfer: cudaMemcpy, cudaMemset

• Execution Control– Kernel launch: cudaLaunch– Config: cudaConfigureCall

• Thread Management– cudaSynchronize

• Device Management– cudaSetDevice, cudaGetDevice

• Memory Management– Allocation: cudaMalloc, cudaFree– Transfer: cudaMemcpy, cudaMemset

• Execution Control– Kernel launch: cudaLaunch– Config: cudaConfigureCall

• Thread Management– cudaSynchronize

CUDA Runtime (Vector Add example)CUDA Runtime (Vector Add example)

__global__ void vecAdd(int A[ ], int B[ ], int C[ ]) { int i = threadIdx.x; C[i] = A[i] + B[i];}

int main() { int hA[ ] = {…}; int hB[ ] = {…};

cudaMemcpy(dA, hA, sizeof(hA), HostToDevice); cudaMemcpy(dB, hB, sizeof(hB), HostToDevice);

vecAdd<<<1, N>>>(dA, dB, dC);

cudaMemcpy(dA, hA, sizeof(hA), DeviceToHost);}

cudaConfigureCallcudaSetupArgumentcudaLaunch

NVIDIA PTXNVIDIA PTX

Low-level IR (close to ISA)

Pseudo-assembly style syntax

Load-Store instruction set

Strongly typed language cvt.s32.u16 %r1, %tid.x;

Unlimited virtual registers

Predicate registers

Low-level IR (close to ISA)

Pseudo-assembly style syntax

Load-Store instruction set

Strongly typed language cvt.s32.u16 %r1, %tid.x;

Unlimited virtual registers

Predicate registers

AMD ILAMD IL

High-level IR

Structured control flow (if-endif, while-end, switch-end)

No predication

32-bit registers (4 components) - vectorization

High-level IR

Structured control flow (if-endif, while-end, switch-end)

No predication

32-bit registers (4 components) - vectorization

Common PTX and IL instructionsCommon PTX and IL instructions

mov.u16 %rh1, %ctaid.x;mov.u16 %rh2, %ntid.x;mul.wide.u16 %r1, %rh1, %rh2;cvt.u32.u16 %r2, %tid.x;add.u32 %r3, %r2, %r1;ld.param.s32 %r4, [N];setp.le.s32 %p1, %r4, %r3;@%p1 bra $LabelA;cvt.u64.s32 %rd1, %r3;mul.lo.u64 %rd2, %rd1, 4;ld.param.u64 %rd3, [A];add.u64 %rd4, %rd3, %rd2;ld.global.f32 %f1, [%rd4+0];ld.param.u64 %rd5, [B];add.u64 %rd6, %rd5, %rd2;ld.global.f32 %f2, [%rd6+0];add.f32 %f3, %f1, %f2;ld.param.u64 %rd7, [C];add.u64 %rd8, %rd7, %rd2;st.global.f32 [%rd8+0], %f3;$LabelA:exit;

• Data movement (mov)• Memory access (ld, st)• Arithmetic (mul, add)• Conversion (cvt)• Comparison and selection (setp)• Control flow (bra): uses predication for conditional branch

vectorAdd (PTX)

Common PTX and IL instructionsCommon PTX and IL instructions

mov r0, vThreadGrpId.xmov r1, cb0[0].ximul r2, r0, r1mov r3, vTidInGrp.xiadd r4, r3, r2mov r5, cb1[3]ige r6, r4, r5if_logicalz r6mov r7, r4imul r8, r7, l0mov r9, cb1[0]iadd r10, r9, r8uav_raw_load_id(0) r11, r10mov r12, cb1[1]iadd r13, r12, r8uav_raw_load_id(0) r14, r13add r15, r11, r14mov r16, cb1[2]iadd r17, r16, r8uav_raw_store_id(0) mem.xyzw, r17, r15endifend

• Data movement (mov)• Memory access (uav_raw)• Arithmetic (imul, iadd)• No conversion instructions• Comparison and Selection (ige)• Control Flow (if_logicalz): structured statements

vectorAdd (IL)

Ocelot Framework*Ocelot Framework*

Implemented as a CUDA library Intercepts library calls PTX Emulation on the CPU Parses PTX into an internal IR Analysis: CFG, SSA, Data flow, optimizations Our work:

IR for IL programs PTX IR -> IL IR translation AMD/CAL Backend

Implemented as a CUDA library Intercepts library calls PTX Emulation on the CPU Parses PTX into an internal IR Analysis: CFG, SSA, Data flow, optimizations Our work:

IR for IL programs PTX IR -> IL IR translation AMD/CAL Backend

*Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. Modeling gpu-cpu workloads and systems. In GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 31–42, New York, NY, USA, 2010. ACM.

Translation FrameworkTranslation Framework

exe

ptx parser

analysis

translation to IL

CAL back-end

Ocelot

compile-time

ATI driver

IL Control TreeIL Control Tree

Based on Structural Analysis*

Build DFS spanning tree of the control flow graph and traverse in postorder

Form regions and collapse the nodes in the CFG

Construct the Control Tree in the process

Repeat until only 1 node is left in the CFG

Based on Structural Analysis*

Build DFS spanning tree of the control flow graph and traverse in postorder

Form regions and collapse the nodes in the CFG

Construct the Control Tree in the process

Repeat until only 1 node is left in the CFG

*S. Muchnick. Advanced Compiler Design and Implementation, chapter 7.7. Morgan Kaufmann, 1997.

IL Control TreeIL Control Tree

Entry

WHILE

BB IF

BB BB BB

cond body

condtrue

false

abstract node representing

regions

Example 1 (if-then)Example 1 (if-then)

mov.u16…setp.le.s32 p1, r4, r3@p1 bra LabelAcvt.u64.s32…LabelA:exit

Entry

Block

IF

BB:setp..

BB:cvt…

BB:mov..

BB:exit

cond true

PTX

Example 1 (if-then)Example 1 (if-then)

Entry

Block

IF

BB:setp..

BB:cvt…

BB:mov..

BB:exit

cond true

mov…ige r6, r4, r5if_logicalz r6mov…endifend

IL

Example 2 (for-loop)Example 2 (for-loop)

mov.u16…setp.le.s32 p1, r5, r3@p1 bra LabelAcvt.u64.s32…LabelB:…setp.lt.s32 p2, r4, r5@p2 bra LabelBLabelA:exit

Entry+

Block

IF

BB:setp..

Block

BB:mov..

BB:exit

cond true

PTX

BB:cvt…

WHILE

setp …

cond body

Example 2 (for-loop)Example 2 (for-loop)

Entry+

Block

IF

BB:setp..

Block

BB:mov..

BB:exit

cond true

BB:cvt…

WHILE

setp …

cond body

mov…ige r7, r4, r6if_logicalz r7mov…whileloop…if_logicalz r17breakendifendloopendifend

IL

Other BT ChallengesOther BT Challenges

Pointer arithmetic in CUDA needs to be emulated in CAL

Translate Application Binary Interface (ABI), e.g. different calling conventions

Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits

Pointer arithmetic in CUDA needs to be emulated in CAL

Translate Application Binary Interface (ABI), e.g. different calling conventions

Architectural bitness: Tesla and Cypress are 32-bit architectures but Fermi is 64-bits

Project StatusProject Status

Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)

3 CUDA applications from the SDK running

Code quality comparable to LLVM code generation

Main CUDA library API’s are implemented (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)

3 CUDA applications from the SDK running

Code quality comparable to LLVM code generation

Next StepsNext Steps

Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case)

Implement other GPGPU abstractions (e.g., shared memory, textures, etc.)

Handle PTX predicated instructions (since IL does not support predication directly)

Enhance translation of the Control Tree to support other IL constructs (e.g., switch-case)

Implement other GPGPU abstractions (e.g., shared memory, textures, etc.)

Handle PTX predicated instructions (since IL does not support predication directly)

Summary and Future WorkSummary and Future Work

GPUs are revolutionizing desktop supercomputing

A number of critical applications have been migrated successfully

CUDA and OpenCL have made these platforms much more accessible for general purpose computing

AMD presently has the highest DP FP performance CUDA presently produces higher performance code for

NVIDIA We are developing a platform that leverages the best of

both worlds

GPUs are revolutionizing desktop supercomputing

A number of critical applications have been migrated successfully

CUDA and OpenCL have made these platforms much more accessible for general purpose computing

AMD presently has the highest DP FP performance CUDA presently produces higher performance code for

NVIDIA We are developing a platform that leverages the best of

both worlds

challenges in binary translation for desktop supercomputing david kaeli rodrigo dominguez department...

Documents

core architectures

core computingthere

cpu vendors

core computingthe cpu

computer displaysmanufacturers

video game

performance growthcost

performancefor gpus