redefining the role of the cpu in the era of cpu-gpu integration
DESCRIPTION
Redefining the Role of the CPU in the Era of CPU-GPU Integration. Manish Arora , Siddhartha Nath , Subhra Mazumdar , Scott Baden and Dean Tullsen Computer Science and Engineering, UC San Diego IEEE Micro Nov – Dec 2012 AMD Research August 20 th 2012. Overview. Motivation - PowerPoint PPT PresentationTRANSCRIPT
Redefining the Role of the CPU in the Era of CPU-GPU Integration
Manish Arora, Siddhartha Nath, Subhra Mazumdar,Scott Baden and Dean Tullsen
Computer Science and Engineering, UC San DiegoIEEE Micro Nov – Dec 2012
AMD Research August 20th 2012
Overview
2
Motivation Benchmarks and Methodology Analysis
CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP
Impact on CPU Design
3
?
ThroughputApplications
General PurposeApplications
MulticoreCPUs
GPGPU
Energy Efficient
GPUs
Performance/Energy/… gains with chip integration
APU
Next-Gen APU
ImprovedGPGPUImproved
Memory Systems
Scaling
…CPU
ArchitectureEasier
Programming
HistoricalProgression
Focus of Improvements
The CPU-GPU Era
4
2011
Llano
2012
Trinity
2013
KaveriAMD APUProducts
Husky (K10)CPU+
NI GPU
Piledriver
CPU +
SI GPU
Steamroller
CPU +
Sea I GPU
Components
Consumer: Phenom /Athlon IIServer:
Barcelona...
Consumer: VisheraServer: Delhi/Abu
Dhabi …Server PartsAPUs have essentially the same
CPU cores as CPU-only parts
Example CPU-GPU Benchmark KMeans (Implementation from Rodinia)
5
Randomly Pick Centers
Find Closest Center for each
Point
Find new Centers
Easy data parallelism over each point
Few Centers with possibly different
#points
GPU CPU
Properties of KMeans
6
Metric CPU Only With GPUTime fraction running
Kernel Code~50% ~16%
(Kernel speedup 5x)
Time spent on the CPU
100% ~84%
Perfect Instruction Level Parallelism
(Window Size 128)
7.0 4.8
“Hard” Branches 2.3% 4.6%
“Hard” Loads 36.2% 64.5%
Application Speedup on 8 Core
CPU
1.5x 1.0x
CPU Performance Critical
+GPU drastically impacts CPU code properties Aim: Understand and Evaluate this “new” CPU workload
The Need to Rethink CPU Design
7
APUs: Prime example of heterogeneous systems Heterogeneity: Composing cores run subsets well CPU need not be fully general-purpose
Sufficient to optimize for Non-GPU code
Investigate Non-GPU code and guide CPU design
Overview
8
Motivation Benchmarks and Methodology Analysis
CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP
Impact on CPU Design
Benchmarks
9
GPUCPU
Serial AppsCPU onlyParallel Apps
Partitioned Apps
CPU Heavy
GPUHeavyMixed
Benchmarks
10
CPU-Heavy (11 Apps) Important computing apps with no evidence of GPU ports SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial] Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel]
Mixed and GPU-Heavy (11 + 11 Apps) Rodinia (7 Apps) SPEC/Parsec mapped to GPUs (15 Apps)
Mixed
11
Benchmark Suite GPU Kernels Kernel SpeedupKmeans Rodinia 2 5.0
H264 SPEC 2 12.1SRAD Rodinia 2 15.0
Sphinx3 SPEC 1 17.7Particlefilter Rodinia 2 32.0
Blackscholes Parsec 1 13.7
Swim SPEC 3 25.3Milc SPEC 18 6.0
Hmmer SPEC 1 19.0LUD Rodinia 1 13.5
Streamcluster
Parsec 1 26.0
GPU-Heavy
12
Benchmark Suite GPU Kernels Kernel SpeedupBwaves SPEC 1 18.0Equake SPEC 1 5.3
Libquantum SPEC 3 28.1Ammp SPEC 2 6.8CFD Rodinia 5 5.5Mgrid SPEC 4 34.3
LBM SPEC 1 31.0Leukocyte Rodinia 3 70.0
Art SPEC 3 6.8Heartwall Rodinia 6 7.9
Fluidanimate Parsec 6 3.9
Methodology
13
Interested in Non-GPU portions of CPU-GPU code Ideal scenario: Port all applications on the GPU and use
hardware counters Man hours / Domain expertise needed / Platform and
architecture dependent code CPU-GPU partitioning based on expert information
Publically available source code (Rodinia) Details of GPU portions from publications and own
implementations (SPEC/Parsec)
Methodology
14
Microarchitectural simulations Marked GPU portions on application code Ran marked applications via PIN based microarchitectural
simulators (ILP, Branches, Loads and Stores) Machine measurements
Using marked code (CPU Criticality) Used parallel CPU source code when available (TLP studies)
Overview
15
Motivation Benchmarks and Methodology Analysis
CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP
Impact on CPU Design
CPU Criticality
16
17
CPU-Heavy
kmeans
h264sra
d
sphinx3
particle
filter
blacksch
olessw
im milc
hmmerlud
stream
cluste
r
Average
bwaves
equake
libquantu
mam
mp cfdmgrid lbm
leukocyte art
heartwall
fluidanimate
Average
Average(A
LL)0
10
20
30
40
50
60
70
80
90
100CPU Time
CPU-Only Non-Kernel Time With Reported Speedups With Conservative Speedups
Prop
ortio
n of
Tot
al A
pplic
ation
Tim
e (%
)
Mixed GPU-Heavy
Mixed: Even though 80% code is mapped to the GPU, the CPU is still
the bottleneck
CPU executes 7-14% of time even for GPU-Heavy
apps
More time spend on the CPU than on the GPU
Future Averages weighted by conservative CPU time
Instruction Level Parallelism
18
Measures inherent instruction stream parallelism Measured ILP with perfect memory and branches
19
parser bzip gobmk mcf sjeng gemsFDTD povray tonto facesim freqmine canneal Average0
5
10
15
20
25
30
Instruction Level Parallelism
Window Size 128 Window Size 512
Para
llel I
nstr
uctio
ns w
ithin
Inst
ructi
on W
indo
w
CPU-Heavy
12.7
9.6
CPU-Heavy
kmeans
h264sra
d
sphinx3
particle
filter
blacksch
olessw
im milc
hmmerlud
stream
cluste
r
Average
bwaves
equake
libquantu
mam
mp cfdmgrid lbm
leukocyte art
heartwall
fluidanimate
Average
Average(A
LL)0
5
10
15
20
25
30
Instruction Level ParallelismWindow Size 128 CPU OnlyWindow Size 128 with GPUWindow Size 512 CPU OnlyWindow Size 512 with GPU
Para
llel I
nstr
uctio
ns w
ithin
Inst
ructi
on W
indo
w
20Mixed GPU-Heavy
CPU+GPU
+GPUCPU
12.7
9.6
10.3 9.2 (128)
15.3 11.1 (512)
14.6 13.7
Overall9.9 9.5 (128)13.7 12.2
(512)
Instruction Level Parallelism
21
ILP dropped in 17 of 22 applications 4% for 128 size and 10.9% for 512 size Dropped by half for 5 applications Mixed apps ILP dropped by as much as 27.5%
Common case Independent loops mapped to the GPU Less regular dependence heavy code on the CPU
Occasionally long dependent chains on the GPU Blackscholes (total of 5/22 outliers)
Potential gains from larger windows are going to be degraded
Branches
22
Branches categorized into 4 categories Biased (> 95% same direction) Patterned (> 95% accuracy on very large local predictor) Correlated (> 95% accuracy on very large gshare predictor) Hard (Remaining)
parser
bzip
gobmkmcf
sjeng
gemsF
DTD
povray
tonto
faces
im
freqmine
cannea
l
Averag
e0
10
20
30
40
50
60
70
80
90
100Branch DistributionHard Correlated Patterned Biased
Perc
enta
ge o
f Dyn
amic
Bra
nche
s
23CPU-Heavy
24.7%
7.0%
13.1%
55.2%
24
0
10
20
30
40
50
60
70
80
90
100Branch DistributionSeries8 Series7 Series6 Series5
Perc
enta
ge o
f Dyn
amic
Bra
nche
s
+GPU
CPU
Mixed GPU-Heavy
11.3%
18.6%Effects of Data-Dependent branches on GPU-Heavy Apps
Overall: Branch predictors tuned for generic CPU execution may not be
sufficient
5.1% 9.4%
Effect of CPU-Heavy Apps
Loads and Stores
25
Loads and Stores categorized into 4 categories Static (> 95% same address) Strided (> 95% accuracy on very large stride predictor) Patterned (> 95% accuracy on very large Markov predictor) Hard (Remaining)
26
parser
bzip
gobmkmcf
sjeng
gemsF
DTD
povray
tonto
faces
im
freqmine
cannea
l
Averag
e0
10
20
30
40
50
60
70
80
90
100Distribution of LoadsHard Patterned Strided
Perc
enta
ge o
f Non
-Triv
ial L
oads
CPU-Heavy
77.5%
5.9%
16.6%
27
parser
bzip
gobmkmcf
sjeng
gemsF
DTD
povray
tonto
faces
im
freqmine
cannea
l
Averag
e0
10
20
30
40
50
60
70
80
90
100Distribution of StoresHard Patterned Strided
Perc
enta
ge o
f Non
-Triv
ial S
tore
s
CPU-Heavy
71.7%
10.2%
18.1%
28
0
10
20
30
40
50
60
70
80
90
100Distribution of LoadsSeries6 Series5 Series4
Perc
enta
ge o
f Non
-Triv
ial L
oads
Mixed GPU-Heavy
Overall: Stride or next line predictors will struggle
44.4%
61.6%
47.3%
27.0%
Effects of kernels with Irregular
accesses moving to the GPU
+GPU
CPU
29
0
10
20
30
40
50
60
70
80
90
100Distribution of StoresSeries6 Series5 Series4
Perc
enta
ge o
f Non
-Triv
ial S
tore
s
Mixed GPU-Heavy
Overall: Slightly less pronounced but similar
results as loads
38.6%
51.3%
48.6%
34.9%+GPU
CPU
30
parser
bzipgo
bmkmcf
sjeng
gemsFD
TDpovra
ytonto
facesim
freqmine
canneal
Averag
e0
5
10
15
20
25
30
35
40Vector Instructions
Perc
enta
ge o
f Dyn
amic
Inst
ructi
ons
CPU-Heavy
7.3%
31
CPU-Heavy
kmeans
h264sra
d
sphinx3
particle
filter
blacksch
olessw
im milc
hmmerlud
stream
cluste
r
Average
bwaves
equake
libquantu
mam
mp cfdmgrid lbm
leukocyte art
heartwall
fluidanimate
Average
Average(A
LL)0
5
10
15
20
25
30
35
40Vector Instructions
SSE Instructions SSE Instructions with GPU
Frac
tion
of D
ynam
ic In
stru
ction
s
Mixed GPU-Heavy
+GPU
CPU
15.0%
8.5%
Vector ISA enhancements targeting the same regions of code as the GPU
16.9%
9.6%
32
parser bzip gobmk mcf sjeng tonto facesim freqmine canneal Geomean0
5
10
15
20
Thread Level Parallelism8 Cores 32 Cores
Spp
edup
CPU-Heavy
33
CPU-Heavy
kmeans
srad
particle
filter
blacksch
olessw
im lud
stream
cluste
r
Geomean
equakeam
mp cfdmgrid
leukocyte art
heartwall
fluidanim
ate
Geomean
Geomean(ALL)
0
5
10
15
20
Thread Level Parallelism
8 Cores 8 Cores with GPU32 Cores 32 Cores with GPU
Spee
dup
Mixed GPU-Heavy
CPU+GPU
+GPUCPU
14.0x
2.1x
Abundant parallelism in
GPU-Heavy disappears.No gain going from 8 cores to
32 cores.
Overall 10% gain going from 8 cores to 32
cores.32 core TLP dropped 60%
from 5.5x to 2.2x
Mixed: Gains drop from 4x
to 1.4x
Overview
34
Motivation Benchmarks and Methodology Analysis
CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP
Impact on CPU Design
CPU Design in the post-GPU Era
35
Only modest gains from increasing window sizes Considerably increased pressure on branch predictor
In spite of fewer static branches Adopt techniques targeting fewer difficult branches (L-Tage Seznec 2007 )
Memory access will continue to be major bottlenecks Stride or next-line prefetching significantly much less relevant Lots of literature but never adapted on real machines (e.g. Helper thread
prefetching or mechanisms targeted at pointer chains) SSE rendered significantly less important
Every core need not have it / cores could share SSE hardware Extra CPU cores/threads not of much use because of lack of TLP
CPU Design in the post-GPU Era
36
(1) Clear case for Big Cores (with a focus on loads/stores/branches and not ILP) + GPUs
(2) Need to start adopting proposals for few-thread performance
(3) Start by relooking old techniques with current perspectives
Backup
On Using Unmodified Source Code
38
Most common memory layout change: AOS -> SOA Still a change in stride value AOS well captured by stride/markov predictors
CPU only code has even better locality well captured by strided/markov predictors
But the locality enhanced accesses map to the GPU Minimal impact on CPU code with GPU: still irregular accesses