the future is heterogeneous computing
TRANSCRIPT
-
8/13/2019 The Future Is Heterogeneous Computing
1/26
-
8/13/2019 The Future Is Heterogeneous Computing
2/26
Page 2|
T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g
|Oct 27, 2010
Workload Example: Changing Consumer Behavior
2
20 hoursof videouploaded to YouTube
every minute
50 million +digital media files
added to personal content libraries
every day
Approximately
9 billionvideo files owned are
high-definition
1000
imagesare uploaded to Facebook
every second
-
8/13/2019 The Future Is Heterogeneous Computing
3/26
Page 3 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Challenges for Next Generation Systems
The Power Wall
Even more broadly constraining in the future!
Complexity Management HW and SW
Principles for managing exponential growth
Parallelism, Programmability and Efficiency
Optimized SW for System-level Solutions
System balance
Memory Technologies and System Design
Interconnect Design
-
8/13/2019 The Future Is Heterogeneous Computing
4/26
Page 4 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
The Power Wall
Easy prediction: Power will continue to be the #1 designconstraint for Computer Systems design.
Why? Vmin will not continue tracking Moores Law
Integration of system-level components consume chip power
A well utilized 100GB/sec DDR memory interface consumes~15W for the I/O alone!
2ndOrder Effects of Power Thermal, packaging & cooling (node-level & datacenter-level)
Electrical stability in the face of rising variablity
Thermal Design Points (TDPs) in all market segmentscontinue to drop
Lightly loaded and idle power characteristics are keyparameters in the Operational Expense (OpEx) equation
Percent of total world energy consumed by computingdevices continues to grow year-on-year
-
8/13/2019 The Future Is Heterogeneous Computing
5/26
Page 5 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Optimized SW for System-level Solutions
Long history of SW optimizations for HW characteristics
Optimizing compilers Cache / TLB blocking
Multi-processor coordination: communication & synchronization
Non-uniform memory characteristics: Process and memory affinity
Scarcity/Abundance principle favors increased use of
Abstractions Abstraction leads to Increased productivity but costs performance
Still allow experts burrow down into lower level on the metal details
System-level Integration Era will demand even more Many Core: user mode and/or managed runtime scheduling?
Heterogeneous Many Core: capability aware scheduling?
SW productivity versus optimization dichotomy
Exposed HW leads to better performance but requires a platform
characteristics aware programming model
-
8/13/2019 The Future Is Heterogeneous Computing
6/26
Page 6 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
The Memory Wall getting thicker
There has always been a Critical Balance betweenData Availabilityand Processing
Situation When? Implication Industry Solutions
DRAM vs CPU Cycle Time GapEarly1990s
Memory wait time
dominates computing
Non-blockingcaches
O-o-O Machines
SW Productivity CrisisObject oriented languages;
Managed runtime environments
Mid
1990s
Larger working sets
More diverse data types
Larger CachesCache Hierarchies
Elaborate prefetch
Single Thread CMP Focus2005 andbeyond
Multiple working sets!
Virtual Machines!
More memory accesses
Huge Caches
Multiple MemoryControllers
Extreme PHYs
New & Emerging Abstractions
Browser-based Runtimes
Image/Video as basic data types
Throughput-based designs
2009 andbeyond
Even larger working sets
Larger data types
A cc e l e r a t e d P a r a l l e l
P r o c e s s i n g
Ch i p St a c k i n g
TBD
-
8/13/2019 The Future Is Heterogeneous Computing
7/26
Page 7 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Interconnect Challenges
Coherence domain knowing when to stop
Interesting implications for on-chip interconnect networks
Industry Mantra: Never bet against Ethernet
But, current Ethernet not well suited for lossless transmission
Troublesome for storage, messaging and more
The more subtle and trickier problems
Adaptive routing, congestion management, QOS, End-to-endcharacteristics, and more
Data centers of tomorrow are going to take great interest inthis area
-
8/13/2019 The Future Is Heterogeneous Computing
8/26
Page 8 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Single-thread Performance
IPC
Issue Width
The IPC Complexity Wall
o
we arehere
Integration(logscale
)
Time
Moores Law
!
we are
here
o
PowerBudget(TDP)
Time
The Power Wall
we are
here
o
Frequency
Time
The Frequency Wall
we are
here
o
Single-th
readPerf
?
Time
we arehere
o
Single thread Perf (!)
- DFM- Variability- Reliability- Wire delay
Server: power=$$
DT: eliminate fansMobile: battery
Performance
Cache Size
Locality
we are
here
o
-
8/13/2019 The Future Is Heterogeneous Computing
9/26
Page 9 | T h e Fu t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64 128
Speed-up
Number of CPU Cores
0% Serial
100% Serial
0
20
40
60
80
100
120
140
1 2 4 8 16 32 64 128
Speed-up
Number of CPU Cores
0% Serial
10% Serial
35% Serial
100% Serial
Parallel Programs and Amdahls Law
Speed-up =1
SW + (1 SW) / N
SW: % Serial Work
N: Number of processors
Assume 100W TDP Socket
10W for global clocking
20W for on-chip network/caches
15W for I/O (memory, PCIe, etc)
This leaves 55W for all the cores
850mW per Core !
-
8/13/2019 The Future Is Heterogeneous Computing
10/26
Page 10 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Transistors
(thousands)
Single-thread
Performance
(SpecINT)
Frequency
(MHz)
Typical Power(Watts)
Number of
Cores
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond and C. Batten
Dotted line extrapolations by C. Moore
35 Years of Microprocessor Trend Data
-
8/13/2019 The Future Is Heterogeneous Computing
11/26
Page 11 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
The Power Wall A g a i n !
Escalating multi-core designs will crash into the power wall justlike single cores did due to escalating frequency
Why?
In order to maintain a reasonable balance, core additions must beaccompanied by increases in other resources that consume power(on-chip network, caches, memory and I/O BW, )
Spiral upwards effect on power
The use of multiple cores forces each core to actually slow down
At some point, the power limits will not even allowyou to activate allof the cores at the same time
Small, low-power cores tend to be very weak on single-threadedgeneral purpose workloads
Customer value proposition will continue to demand excellentperformance on general purpose workloads
The transition to compelling general purpose parallel workloads willnot be a fast one
-
8/13/2019 The Future Is Heterogeneous Computing
12/26
-
8/13/2019 The Future Is Heterogeneous Computing
13/26
Page 13 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Three Eras of Processor Performance
Single-CoreEra
Single-threa
d
Performance
?
Time
we arehere
o
Enabled by: Moores Law Voltage Scaling MicroArchitecture
Constrained by:Power
Complexity
Multi-CoreEra
Throughpu
tPerformance
Time(# of Processors)
we are
here
o
Enabled by: Moores Law Desire for Throughput 20 years of SMP arch
Constrained by:Power
Parallel SW availabilityScalability
HeterogeneousSystems Era
TargetedApplication
Performance
Time(Data-parallel exploitation)
we are
here
o
Enabled by: Moores Law Abundant data parallelism Power efficient GPUs
C u r r e n t l y constrained by:Programming models
Communication overheads
-
8/13/2019 The Future Is Heterogeneous Computing
14/26
Page 14 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
2003
AMD x86 64-bit CMP Evolution
2005 2007 2008 2009 2010
AMD Opteron Dual-CoreAMD Opteron
Quad-CoreAMD Opteron
45nm Quad-Core
AMD Opteron
Six-CoreAMD Opteron
AMD Opteron6100 Series
Mfg.Process
90nm SOI 90nm SOI 65nm SOI 45nm SOI 45nm SOI 45nm SOI
CPU Core
K8 K8 Greyhound Greyhound+ Greyhound+ Greyhound+
L2/L3 1MB/0 1MB/0 512kB/2MB 512kB/6MB 512kB/6MB 512kB/12MB
HyperTransportTechnology
3x 1.6GT/.s 3x 1.6GT/.s 3x 2GT/s 3x 4.0GT/s 3x 4.8GT/s 4x 6.4GT/s
Memory 2x DDR1 300 2x DDR1 400 2x DDR2 667 2x DDR2 800 2x DDR2 1066 4x DDR3 1333
Max Power Budget Remains Consistent
-
8/13/2019 The Future Is Heterogeneous Computing
15/26
Page 15 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
L3
CACHE
AMD Opteron 6100 SeriesSilicon and Package
L3
CACHE
Core 2Core 1 Core 3
Core 4 Core 5 Core 6
12AMD64 x86 Cores
18 MB on-chip cache
4 Memory Channels @ 1333 MHz
4 HT Links @ 6.4 GT/sec
-
8/13/2019 The Future Is Heterogeneous Computing
16/26
Page 16 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
AMD Radeon HD5870 GPU Architecture
-
8/13/2019 The Future Is Heterogeneous Computing
17/26
17
GPU Processing Performance Trend
0
500
1000
1500
2000
2500
3000
Sep-05
Mar-06
Oct-0
6
Apr-0
7
Nov-07
Jun-08
Dec-08
Jul-0
9
GigaFLOPS
RV770ATI RADEON
HD 4800ATI Fi rePro
V8700AM D FireStr eam
92509270
RV670ATI RADEON
HD 3800ATI Fi reGL
V7700AM D FireStr eam
9170
R600ATI RADEON
HD 2900ATI Fi reGL
V7600V8600V8650
R580(+)ATI RADEONX19xx
ATI Fi reStr eamR520ATI RADEON
X1800ATI Fi reGL
V7200V7300V7350
Unified
Shaders
Double-precision
floating pointGPGPU
via CTM
Stream SDK
CAL+IL/Brook+
2.5x ALU
increase
* Peak s ingle-precision perf ormance;For RV670, RV770 & Cypress divide by 5 f or peak double-precision performance
* CypressATI RADEON
HD 5870
OpenCL 1.1+
DirectX 11
2.25x Perf.
-
8/13/2019 The Future Is Heterogeneous Computing
18/26
18
0
2
4
6
8
1 0
1 2
1 4
1 6
Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09
ATI RadeonX1800 XT
ATI RadeonX1900 XTX
ATI Radeon HD2900 PRO
ATI Radeon HD3870
ATI Radeon HD4870
ATI Radeon HD5870
GPU Efficiency
7.50
4.56
4.50
2.24
2.21
0.92
2.01
1.06
1.07
0.42
GFLOPS/W
GFLOPS/mm2
14.47GFLOPS/W
7.90GFLOPS/mm
2
-
8/13/2019 The Future Is Heterogeneous Computing
19/26
19
Digital ContentCreation
AMD Accelerated ParallelProcessing (APP) Technology is
EngineeringSciences Government
Gaming Productivity
Heterogeneous: Developers leverage AMD GPUs and CPUs foroptimal application performance and user experience
High performance: Massively parallel, programmable GPUarchitecture delivers unprecedented performance and power efficiency
Industry Standards: OpenCL enables cross-platform development
-
8/13/2019 The Future Is Heterogeneous Computing
20/26
-
8/13/2019 The Future Is Heterogeneous Computing
21/26
21
Heterogeneous Computing:Next-Generation Software Ecosystem
Hardware & Drivers: AMD Fusion,Discrete CPUs/GPUs
OpenCL & Direct Compute
Tools: HLL
compilers,Debuggers,
ProfilersMiddleware/Libraries: Video,
Imaging, Math/Sciences,Physics
High Level
Frameworks
End-user Applications
AdvancedOptimizations
&LoadBalancingLoad balance
across CPUs and
GPUs; leverage
AMD Fusion
performance
advantagesDrive new
features into
industry standards
Increase ease of
applicationdevelopment
-
8/13/2019 The Future Is Heterogeneous Computing
22/26
22
AMD Balanced Platform Advantage
Delivers advanced performance for a wide rangeof platform configurations
Other Highly
Parallel Workloads
Graphics Workloads
Serial/Task-Parallel
Workloads
CPU is excellent for running somealgorithms
Ideal place to process if GPU isfully loaded
Great use for additional CPUcores
GPU is ideal for data parallel algorithmslike image processing, CAE, etc
Great use for AMD AcceleratedParallel Processing (APP)technology
Great use for additional GPUs
-
8/13/2019 The Future Is Heterogeneous Computing
23/26
Page 23 | T h e F u t u r e I s H e t e r o g e n e o u s Com p u t i n g | Oct 27, 2010
Challenges: Extracting Parallelism
i=0i++
load x(i)fmulstore
cmp i (1000000)bc
i,j=0i++
j++load x(i,j)
fmulstore
cmp j (100000)bc
cmp i (100000)bc
2D array
representing
very large
dataset
Loop 1M
times for
1M pieces
of data
Coarse-grain data
parallel Code
Maps very well toThroughput-orienteddata parallel engines
i=0
i++load x(i)fmulstore
cmp i (16)bc
Loop 16 times for 16
pieces of data
Fine-grain data
parallel Code
Maps very well tointegrated SIMD
dataflow (ie: SSE)
Nested data
parallel Code
Lots of conditional dataparallelism. Benefitsfrom closer couplingbetween CPU & GPU
-
8/13/2019 The Future Is Heterogeneous Computing
24/26
24
A New Era of Processor Performance
Throughput Performance GPU
Homogeneous
Computing
S y st em - l ev e l p r o g r a m m a b le
Mu l t i - C o r e
Er a
H e t e r o g e n e o u s
Sy s t em s Er a
S i n g l e -Co r e
Er a
H e t e r o g e n e o u s
C om p u t i n g
G r a p h i c s d r i v e r - b a s e d p r o g r a m s
O p en CL / D X d r i v e r - b a s e d p r o g r a m s
Programmability
CPU
M ic r o p r o c e ss o r A d v a n c em e n t
GPU
A
dvancem
ent
-
8/13/2019 The Future Is Heterogeneous Computing
25/26
25
Now the AMD Fusion Era of Computing Begins
-
8/13/2019 The Future Is Heterogeneous Computing
26/26
26
DISCLAIMER
The information presented in this document is for informational purposes only and may contain t echnical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes,component and motherboard version changes, new model and/or product releases, product dif ferences between differing manufacturers, software changes,BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reservesthe right t o revise this information and to make changes f rom time to time to the content hereof without obligation of AMD to notify any person of suchrevisions or changes.
AMD MAKES NO REPRESENTATIONS OR WA RRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES,ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS A NY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE
LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES A RISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
This presentation contains forward-looking st atements concerning AMD and tec hnology partner product offerings which are made pursuant to the safe harbor provis ions of the Priv ate
Sec urities Litigation Reform A ct of 19 95. Forward-looking statements are commonly identified by words s uch as "would," "may," "expects," "believes," "plans," "intends,"
st rate gy, roadmaps , "project s" and othe r terms with similar meaning. Investors are c autioned that the forward- looking st atements in this presentation are bas ed on currentbeliefs , as sumptions and expectations, speak only as of the date of this pres entation and involve risks and uncertainties that could cause ac tual results to differ material ly fromcurrent expectations.
ATTRIBUTION
20 10 Advanced Micro Devices, Inc. A ll rights reserved. A MD, the AMD A rrow logo, A MD O pteron, ATI, the ATI logo, Radeonand c ombinations thereof are trademarks of A dvancedMi cro D evices, Inc. Mi crosoft, Windows, and Windows V ista are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. O penCL is trademark of
A pple Inc. used under license to the Khronos G roup Inc. O ther names are for informational purposes only and may be trademarks of their respective owners.