amd accelerated computing -ufrj
TRANSCRIPT
Accelerated ComputingRoberto Brandão
AMD Latin America
OpenCL
GPUCPU
DirectCompute
Agenda
X86 PROCESSOR EVOLUTION
THE GPU AS AN ACCELERATOR
ACCELERATED PROCESSING UNITS
INTRODUCTION TO OpenCL
Evolving x86 Processors
L3 Cache
AMD architecture“Istambul” six-core diagram
PCI-eChipsetChipset
HyperTransport
Memory Controller
Hyper Transport
CROSSBAR
Lower memory latency
Balancedcaches
Fast full-duplexbus
Native six-core
processor
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
4P/24-core system examplevery good scalability
One memory controller for every processor
Full-duplex Hyper Transport links (up to 5.2GHz)
Bus Optimization: HT Assist (Cache Probe Filtering)
Still the only available 4P system with Direct Connect Architecture
MEM
OR
Y
MEM
OR
Y
MEM
OR
Y
MEM
OR
Y
Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores
2 M
EM
OR
Y
CH
AN
NE
LS2 M
EM
OR
Y
CH
AN
NE
LS
2 M
EM
OR
Y
CH
AN
NE
LS2 M
EM
OR
Y
CH
AN
NE
LS
8 DIMMs per CPU
8 DIMMs per CPU
8 DIMMs per CPU
8 DIMMs per CPU
No front side bus
Integrated memory controller
HyperTransport™ technology
NUMA memory architecture
12 DIMMs per CPU
Direct Connect Architecture 2.0Balanced and Scalable Design to Support up to 16 Cores* per CPU
• 1-hop between processors
• Up to 50% more DIMMs
• Four memory channels
• Up to 33% increase in CPU to CPU communication speed±
4 M
EM
OR
Y
CH
AN
NE
LS
12 DIMMs per CPU
12 DIMMs per CPU
12 DIMMs per CPU
4 M
EM
OR
Y
CH
AN
NE
LS
4 M
EM
OR
Y
CH
AN
NE
LS
4 M
EM
OR
Y
CH
AN
NE
LS
What is next for x86 CPUs
• More processor cores to come
(12, 16, 16 double cores)
• More memory channels (improves memory bandwidth per core)
• Improved IPC
(8 per cycle is a target)
Top500 list - beyond the petaflop
Datacenters in the USA will spend more
than $3 billion on energy in 2009
Garry Kasparov IBM Deep Blue
1997:
X
The World’s Most Powerful GPU
=
177x IBM Deep Blue
2011 GPU Architecture AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
Improved anti-aliasing performance
Fast 256-bit GDDR5 memory interface
Up to 5.5 Gbps
New GPU compute features
Designing very efficient GPUsFull load: 180W; Idle:27W
Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09ATI Radeon™
X1800 XTATI Radeon™
X1900 XTXATI Radeon™ HD
2900 PROATI Radeon™ HD
3870ATI Radeon™ HD
4870ATI Radeon™ HD
5870
0
2
4
6
8
10
12
14
16
7.50
4.56
4.50
2.24
2.21
0.92
2.01
1.06
1.07
0.42
GFLOPS/W
GFLOPS/mm2
14.47GFLOPS/W
7.90GFLOPS/mm2
Old and New in High Performance Computing
Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)
Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited
GPUs: more than just gaming
15
Radeon HD 5970
12 Cores
Hexa Core
Quad Core
Dual Core
Single Core
2600
144
72
48
24
12
Processing power – millions of operations per second
2700
Wii Sports - Golf Oil exploration platform - 2010
Both use GPUs
DirectX® 11 Multi-Threading
Application, DirectX runtime, and DirectX driver can each run in separate threads
Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread
DirectX® 10 DirectX® 11
16
Today’s GPUs focused on
GAMING
ENTERTAINMENT
PRODUCTIVITY
DirectX® 11 Tessellation
Images courtesy of Unigine Corp.
No Tessellation Tessellation
DirectX® 10 DirectX® 11
18
04/13/2023
04/13/2023
Research companies already using
21
Oil exploration Wheather forecast Fluid Dynamics Nature simulation
AMD Balanced Platform
Delivers optimal performance for a wide range of platform configurations
Other Highly Parallel Workloads
Graphics Workloads
Serial/Task-Parallel Workloads
CPU is excellent for running some algorithms
Ideal place to process if GPU is fully loaded
Great use for additional CPU cores
CPU is excellent for running some algorithms
Ideal place to process if GPU is fully loaded
Great use for additional CPU cores
GPU is ideal for data parallel algorithms like image processing, CAE, etc
Great use for ATI Stream technology
Great use for additional GPUs
GPU is ideal for data parallel algorithms like image processing, CAE, etc
Great use for ATI Stream technology
Great use for additional GPUs
ATI Stream Technology is…
Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience
High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency
Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development
Digital Content Creation
EngineeringSciences Government Gaming Productivity
Improvements already reached consumers
0%
10%
20%
30%
40%
50%
60%
70%
80%
Processor utilization
ATI Stream
Adobe Flash plugin used by Youtube.com Better image quality and video smoothness Lower processor usage
GPU-accelerated video transcoding
Up to 6x faster when using an AMD graphics card
HD VideoIpod Video
Using fourCPU Cores
Frames Frames
CPU Usage: 100%
GPU Usage: 1%
Video Transcoding SampleNo GPU Acceleration
CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h
GPU Usage: 1% Peak power: 145W Energy Price: $0.1526
Frames Frames
CPU Usage: 45%
GPU Usage: 35%
Video Transcoding SampleATI GPU Acceleration
CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)
GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15)
Using hundreds ofStream Processors
ControlControl
27
FUSION TECHNOLOGY
Today
TeraFLOPS-class GPU
Up to 2 billion transistors
Jogos em multiplos monitores
Video e audio Full HD
Multi-core CPU
~800 million transistors
Multi-tasking
A new Era on performance evolution
Perf
orm
ance
Time
We are here
Pros: Performance Power efficient
Cons:Software availability
Heterogeneous computing
Perf
orm
ance
Time x Cores
Challenge:Power consumptionSoftware
Multi-Core
We are here
Challenge:Power consumptionComplexity
?
Single-Core
Sin
gle
-thre
ad
Time
We are here
A new Era on performance evolution
Software Acceleration
Multi-CoreSingle-Core
Gaming
Multimedia
CP
U
GPU
Low power consumptio
n
Core efficiency
Putting all together – The Future is Fusion
Cache L3
PCI-eChipsetChipset
HyperTransport
Memory Controller
Hyper Transport
CROSSBAR
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
RV500 GPU Core (2006)AMD “Istambul” six-core processor
MemoryController
RingStop
RingStop
RingStop
RingStop
Client InterfaceClient Interface
Client InterfaceClient Interface
Clie
nt
Inte
rfac
eC
lien
t In
terf
ace
Clien
t Interface
Clien
t Interface
Putting all together – The Future is Fusion
Cache L3
PCI-eChipsetChipset
HyperTransport
Memory Controller
Hyper Transport
CROSSBAR
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
RV700 GPU Core (2008-2009)AMD “Istambul” six-core processor
Putting all together – The Future is Fusion
CROSSBAR
RV700 GPU CoreAMD “Istambul” six-core processorC
RO
SS
BA
R
2011: welcome to the APU time!
GPUCPU
“Supercomputing power in a notebook platform whose battery lasts for a full day”
APU
One Design, Fewer Watts, Massive Capability
Discrete-level DirectX® 11
GPU
“Zacate” AMD
Fusion APU
75 sq. mm 18 watts
NorthbridgeDual-Core
CPU+ + =
66 sq. mm 13 watts
117 sq. mm 25 watts
59 sq. mm 8 watts
Graphics and Media Processing Efficiency Improvements
CPU Cores
GPU UVD
SB Functions
~7 GB/sec
~17 GB/sec
UNB
MC
~17 GB/sec
DDR3 DIMMMemory
CPU Chip
PCIe
Bandwidth pinch points and latency hold back the GPU capabilities
3X bandwidth between GPU and memory
Even the same sized GPU is substantially more effective in this configuration
Eliminate latency and power associated with the extra chip crossing
Substantially smaller physical foot print
Graphics requires memory bandwidth
to bring full capabilities to life
~27 GB/sec
~27 GB/sec
DDR3 DIMMMemory
APU Chip
PCIe
2010 IGP-based Platform 2011 APU-based Platform
GPU
CPU Cores
UVD
UN
B / M
C
“Ontario” & “Zacate” Architecture
APU>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2,
64-bit FPU)>C6 and power gating>Array of SIMD Engines
• DX11 graphics performance• Industry leading 3D and graphics processing
>3rd Generation Unified Video Decoder>H.264, VC1, DixX/Xvid format
>DDR3 800-1066, 2 DIMMs, 64 bit channel>BGA package
Display and I/O>Two dedicated digital display interfaces
• Configurable externally as HDMI, DVI, and/or Display Port
• Also supports a single link LVDS for internal panels
>Integrated VGA>5x8 PCIe® > “Hudson” Fusion Controller Hub
Working togetherOpenCL
ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs
The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience
• First complete OpenCL™ development platform
• Certified OpenCL 1.0 compliant by the Khronos Group
• Write code that can scale well on multi-core CPUs and GPUs
• AMD delivers on the promise of OpenCL™, with both high-performance CPU and GPU technologies
• Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
http://developer.amd.com/
OpenCL™: Game-Changing DevelopmentEnabling Broad Adoption of GP-GPU Capabilities
Industry standard API: Open, multiplatform development platform for heterogeneous architectures
The power of Fusion: Leverages CPUs and GPUs for balanced system approach
Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.
Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution
Momentum: Enormous interest from mainstream developers and application ISVs
More stream-enabled applications across all markets
Open Standards:
Vendor specificCross-platform limiters
• Apple Display Connector
• 3dfx Glide
• Nvidia CUDA
• Nvidia Cg
• Rambus
• Unified Display Interface
OpenCL™ and DirectX® are emerging as the two most important standards for heterogeneous (CPU+GPU) compute
Digital Visual
Interface
OpenCL™ DirectX®
Certified DP JEDEC
Maximize Developer Freedom and Addressable Market
Vendor neutralCross-platform enablers
OpenGL®
Comparing OpenCL™ and DirectX® 11 DirectCompute
How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?
Feature set is similar in both APIs
DirectX® 11 DirectCompute
Easiest path to add compute capabilities to existing DirectX applications
Windows Vista® and Windows® 7 only
OpenCL™
Ideal path for new applications porting to the GPU for the first time
True multiplatform: Windows®, Linux®, MacOS
Natural programming without dealing with a graphics API
Anatomy of OpenCL™
Language Specification
• C-based cross-platform programming interface• Subset of ISO C99 with language extensions - familiar to developers• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error• Online or offline compilation and build of compute kernel executables• Includes a rich set of built-in functions
Platform Layer API
• A hardware abstraction layer over diverse computational resources• Query, select and initialize compute devices• Create compute contexts and work-queues
Runtime API
• Execute compute kernels• Manage scheduling, compute, and memory resources
OpenCL Example
Scalar
void square(int n, const float *a, float *result){ int i; for (i=0; i<n; i++) result[i] = a[i] * a[i];}
Data-Parallel
kernel dp_square (const float *a, float *result){ int id = get_global_id(0); result[id] = a[id] * a[id];}
// dp_square executes oven “n” work-items
Summary
46
X86 PROCESSOR EVOLUTION
THE GPU AS AN ACCELERATOR
ACCELERATED PROCESSING UNITS
INTRODUCTION TO OpenCLhttp://developer.amd.com
Obrigado!
Obrigado!