amd accelerated computing -ufrj

Accelerated ComputingRoberto Brandão

AMD Latin America

OpenCL

GPUCPU

DirectCompute

Agenda

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCL

Evolving x86 Processors

L3 Cache

AMD architecture“Istambul” six-core diagram

PCI-eChipsetChipset

HyperTransport

Memory Controller

Hyper Transport

CROSSBAR

Lower memory latency

Balancedcaches

Fast full-duplexbus

Native six-core

processor

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

4P/24-core system examplevery good scalability

One memory controller for every processor

Full-duplex Hyper Transport links (up to 5.2GHz)

Bus Optimization: HT Assist (Cache Probe Filtering)

Still the only available 4P system with Direct Connect Architecture

MEM

OR

Y

MEM

OR

Y

MEM

OR

Y

MEM

OR

Y

Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores

2 M

EM

OR

Y

CH

AN

NE

LS2 M

EM

OR

Y

CH

AN

NE

LS

2 M

EM

OR

Y

CH

AN

NE

LS2 M

EM

OR

Y

CH

AN

NE

LS

8 DIMMs per CPU

8 DIMMs per CPU

8 DIMMs per CPU

8 DIMMs per CPU

No front side bus

Integrated memory controller

HyperTransport™ technology

NUMA memory architecture

12 DIMMs per CPU

Direct Connect Architecture 2.0Balanced and Scalable Design to Support up to 16 Cores* per CPU

• 1-hop between processors

• Up to 50% more DIMMs

• Four memory channels

• Up to 33% increase in CPU to CPU communication speed±

4 M

EM

OR

Y

CH

AN

NE

LS

12 DIMMs per CPU

12 DIMMs per CPU

12 DIMMs per CPU

4 M

EM

OR

Y

CH

AN

NE

LS

4 M

EM

OR

Y

CH

AN

NE

LS

4 M

EM

OR

Y

CH

AN

NE

LS

What is next for x86 CPUs

• More processor cores to come

(12, 16, 16 double cores)

• More memory channels (improves memory bandwidth per core)

• Improved IPC

(8 per cycle is a target)

Top500 list - beyond the petaflop

Datacenters in the USA will spend more

than $3 billion on energy in 2009

Garry Kasparov IBM Deep Blue

1997:

X

The World’s Most Powerful GPU

=

177x IBM Deep Blue

2011 GPU Architecture AMD Radeon™ HD 6900 Series

Dual graphics engines

New VLIW4 core architecture

Up to 24 SIMD engines

Up to 96 Texture Units

Upgraded render back-ends

Improved anti-aliasing performance

Fast 256-bit GDDR5 memory interface

Up to 5.5 Gbps

New GPU compute features

Designing very efficient GPUsFull load: 180W; Idle:27W

Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09ATI Radeon™

X1800 XTATI Radeon™

X1900 XTXATI Radeon™ HD

2900 PROATI Radeon™ HD

3870ATI Radeon™ HD

4870ATI Radeon™ HD

5870

0

2

4

6

8

10

12

14

16

7.50

4.56

4.50

2.24

2.21

0.92

2.01

1.06

1.07

0.42

GFLOPS/W

GFLOPS/mm2

14.47GFLOPS/W

7.90GFLOPS/mm2

Old and New in High Performance Computing

Old: Power is free, Transistors are expensive

New: Power expensive, Transistors free

(Can put more transistors on chip than can afford to turn on)

Old: Multiplies are slow, Memory access is fast

New: Multiplies fast, Memory slow

(up 200 clocks to DRAM memory, 4 clocks for FP multiply)

Old: Increasing Instruction Level Parallelism via compilers innovation

New: Explicit thread and data parallelism must be exploited

GPUs: more than just gaming

15

Radeon HD 5970

12 Cores

Hexa Core

Quad Core

Dual Core

Single Core

2600

144

72

48

24

12

Processing power – millions of operations per second

2700

Wii Sports - Golf Oil exploration platform - 2010

Both use GPUs

DirectX® 11 Multi-Threading

Application, DirectX runtime, and DirectX driver can each run in separate threads

Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread

DirectX® 10 DirectX® 11

16

Today’s GPUs focused on

GAMING

ENTERTAINMENT

PRODUCTIVITY

DirectX® 11 Tessellation

Images courtesy of Unigine Corp.

No Tessellation Tessellation

DirectX® 10 DirectX® 11

18

04/13/2023

Research companies already using

21

Oil exploration Wheather forecast Fluid Dynamics Nature simulation

AMD Balanced Platform

Delivers optimal performance for a wide range of platform configurations

Other Highly Parallel Workloads

Graphics Workloads

Serial/Task-Parallel Workloads

CPU is excellent for running some algorithms

Ideal place to process if GPU is fully loaded

Great use for additional CPU cores

CPU is excellent for running some algorithms

Ideal place to process if GPU is fully loaded

Great use for additional CPU cores

GPU is ideal for data parallel algorithms like image processing, CAE, etc

Great use for ATI Stream technology

Great use for additional GPUs

GPU is ideal for data parallel algorithms like image processing, CAE, etc

Great use for ATI Stream technology

Great use for additional GPUs

ATI Stream Technology is…

Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience

High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency

Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development

Digital Content Creation

EngineeringSciences Government Gaming Productivity

Improvements already reached consumers

0%

10%

20%

30%

40%

50%

60%

70%

80%

Processor utilization

ATI Stream

Adobe Flash plugin used by Youtube.com Better image quality and video smoothness Lower processor usage

GPU-accelerated video transcoding

Up to 6x faster when using an AMD graphics card

HD VideoIpod Video

Using fourCPU Cores

Frames Frames

CPU Usage: 100%

GPU Usage: 1%

Video Transcoding SampleNo GPU Acceleration

CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h

GPU Usage: 1% Peak power: 145W Energy Price: $0.1526

Frames Frames

CPU Usage: 45%

GPU Usage: 35%

Video Transcoding SampleATI GPU Acceleration

CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)

GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15)

Using hundreds ofStream Processors

ControlControl

27

FUSION TECHNOLOGY

Today

TeraFLOPS-class GPU

Up to 2 billion transistors

Jogos em multiplos monitores

Video e audio Full HD

Multi-core CPU

~800 million transistors

Multi-tasking

A new Era on performance evolution

Perf

orm

ance

Time

We are here

Pros: Performance Power efficient

Cons:Software availability

Heterogeneous computing

Perf

orm

ance

Time x Cores

Challenge:Power consumptionSoftware

Multi-Core

We are here

Challenge:Power consumptionComplexity

?

Single-Core

Sin

gle

-thre

ad

Time

We are here

A new Era on performance evolution

Software Acceleration

Multi-CoreSingle-Core

Gaming

Multimedia

CP

U

GPU

Low power consumptio

n

Core efficiency

Putting all together – The Future is Fusion

Cache L3

PCI-eChipsetChipset

HyperTransport

Memory Controller

Hyper Transport

CROSSBAR

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

RV500 GPU Core (2006)AMD “Istambul” six-core processor

MemoryController

RingStop

RingStop

RingStop

RingStop

Client InterfaceClient Interface

Client InterfaceClient Interface

Clie

nt

Inte

rfac

eC

lien

t In

terf

ace

Clien

t Interface

Clien

t Interface


Cache L3

PCI-eChipsetChipset

HyperTransport

Memory Controller

Hyper Transport

CROSSBAR

2

L2

3

L2

4

L2

5

L2

6

L2

1

L2

RV700 GPU Core (2008-2009)AMD “Istambul” six-core processor


CROSSBAR

RV700 GPU CoreAMD “Istambul” six-core processorC

RO

SS

BA

R

2011: welcome to the APU time!

GPUCPU

“Supercomputing power in a notebook platform whose battery lasts for a full day”

APU

One Design, Fewer Watts, Massive Capability

Discrete-level DirectX® 11

GPU

“Zacate” AMD

Fusion APU

75 sq. mm 18 watts

NorthbridgeDual-Core

CPU+ + =

66 sq. mm 13 watts

117 sq. mm 25 watts

59 sq. mm 8 watts

Graphics and Media Processing Efficiency Improvements

CPU Cores

GPU UVD

SB Functions

~7 GB/sec

~17 GB/sec

UNB

MC

~17 GB/sec

DDR3 DIMMMemory

CPU Chip

PCIe

Bandwidth pinch points and latency hold back the GPU capabilities

3X bandwidth between GPU and memory

Even the same sized GPU is substantially more effective in this configuration

Eliminate latency and power associated with the extra chip crossing

Substantially smaller physical foot print

Graphics requires memory bandwidth

to bring full capabilities to life

~27 GB/sec

~27 GB/sec

DDR3 DIMMMemory

APU Chip

PCIe

2010 IGP-based Platform 2011 APU-based Platform

GPU

CPU Cores

UVD

UN

B / M

C

“Ontario” & “Zacate” Architecture

APU>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2,

64-bit FPU)>C6 and power gating>Array of SIMD Engines

• DX11 graphics performance• Industry leading 3D and graphics processing

>3rd Generation Unified Video Decoder>H.264, VC1, DixX/Xvid format

>DDR3 800-1066, 2 DIMMs, 64 bit channel>BGA package

Display and I/O>Two dedicated digital display interfaces

• Configurable externally as HDMI, DVI, and/or Display Port

• Also supports a single link LVDS for internal panels

>Integrated VGA>5x8 PCIe® > “Hudson” Fusion Controller Hub

Working togetherOpenCL

ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs

The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience

• First complete OpenCL™ development platform

• Certified OpenCL 1.0 compliant by the Khronos Group

• Write code that can scale well on multi-core CPUs and GPUs

• AMD delivers on the promise of OpenCL™, with both high-performance CPU and GPU technologies

• Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support

http://developer.amd.com/



OpenCL™: Game-Changing DevelopmentEnabling Broad Adoption of GP-GPU Capabilities

Industry standard API: Open, multiplatform development platform for heterogeneous architectures

The power of Fusion: Leverages CPUs and GPUs for balanced system approach

Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.

Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution

Momentum: Enormous interest from mainstream developers and application ISVs

More stream-enabled applications across all markets

Open Standards:

Vendor specificCross-platform limiters

• Apple Display Connector

• 3dfx Glide

• Nvidia CUDA

• Nvidia Cg

• Rambus

• Unified Display Interface

OpenCL™ and DirectX® are emerging as the two most important standards for heterogeneous (CPU+GPU) compute

Digital Visual

Interface

OpenCL™ DirectX®

Certified DP JEDEC

Maximize Developer Freedom and Addressable Market

Vendor neutralCross-platform enablers

OpenGL®

Comparing OpenCL™ and DirectX® 11 DirectCompute

How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?

Feature set is similar in both APIs

DirectX® 11 DirectCompute

Easiest path to add compute capabilities to existing DirectX applications

Windows Vista® and Windows® 7 only

OpenCL™

Ideal path for new applications porting to the GPU for the first time

True multiplatform: Windows®, Linux®, MacOS

Natural programming without dealing with a graphics API

Anatomy of OpenCL™

Language Specification

• C-based cross-platform programming interface• Subset of ISO C99 with language extensions - familiar to developers• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error• Online or offline compilation and build of compute kernel executables• Includes a rich set of built-in functions

Platform Layer API

• A hardware abstraction layer over diverse computational resources• Query, select and initialize compute devices• Create compute contexts and work-queues

Runtime API

• Execute compute kernels• Manage scheduling, compute, and memory resources

OpenCL Example

Scalar

void square(int n, const float *a, float *result){ int i; for (i=0; i<n; i++) result[i] = a[i] * a[i];}

Data-Parallel

kernel dp_square (const float *a, float *result){ int id = get_global_id(0); result[id] = a[id] * a[id];}

// dp_square executes oven “n” work-items

Summary

46

X86 PROCESSOR EVOLUTION

THE GPU AS AN ACCELERATOR

ACCELERATED PROCESSING UNITS

INTRODUCTION TO OpenCLhttp://developer.amd.com

Obrigado!

[email protected]

[email protected]

Obrigado!

amd accelerated computing -ufrj

Technology

cpu8 dimms

cpu12 dimms

cpu communication speed

scalable design

p system

evolving x86 processors

ghzbus optimization

hyper transport links