redefining the role of the cpu in the era of cpu-gpu integration

Redefining the Role of the CPU in the Era of CPU-GPU Integration

Manish Arora, Siddhartha Nath, Subhra Mazumdar,Scott Baden and Dean Tullsen

Computer Science and Engineering, UC San DiegoIEEE Micro Nov – Dec 2012

AMD Research August 20th 2012

Overview

2

Motivation Benchmarks and Methodology Analysis

CPU Criticality ILP Branches Loads and Stores Vector Instructions TLP

Impact on CPU Design

3

?

ThroughputApplications

General PurposeApplications

MulticoreCPUs

GPGPU

Energy Efficient

GPUs

Performance/Energy/… gains with chip integration

APU

Next-Gen APU

ImprovedGPGPUImproved

Memory Systems

Scaling

…CPU

ArchitectureEasier

Programming

HistoricalProgression

Focus of Improvements

The CPU-GPU Era

4

2011

Llano

2012

Trinity

2013

KaveriAMD APUProducts

Husky (K10)CPU+

NI GPU

Piledriver

CPU +

SI GPU

Steamroller

CPU +

Sea I GPU

Components

Consumer: Phenom /Athlon IIServer:

Barcelona...

Consumer: VisheraServer: Delhi/Abu

Dhabi …Server PartsAPUs have essentially the same

CPU cores as CPU-only parts

Example CPU-GPU Benchmark KMeans (Implementation from Rodinia)

5

Randomly Pick Centers

Find Closest Center for each

Point

Find new Centers

Easy data parallelism over each point

Few Centers with possibly different

#points

GPU CPU

Properties of KMeans

6

Metric CPU Only With GPUTime fraction running

Kernel Code~50% ~16%

(Kernel speedup 5x)

Time spent on the CPU

100% ~84%

Perfect Instruction Level Parallelism

(Window Size 128)

7.0 4.8

“Hard” Branches 2.3% 4.6%

“Hard” Loads 36.2% 64.5%

Application Speedup on 8 Core

CPU

1.5x 1.0x

CPU Performance Critical

+GPU drastically impacts CPU code properties Aim: Understand and Evaluate this “new” CPU workload

The Need to Rethink CPU Design

7

APUs: Prime example of heterogeneous systems Heterogeneity: Composing cores run subsets well CPU need not be fully general-purpose

Sufficient to optimize for Non-GPU code

Investigate Non-GPU code and guide CPU design

Overview

8




Benchmarks

9

GPUCPU

Serial AppsCPU onlyParallel Apps

Partitioned Apps

CPU Heavy

GPUHeavyMixed

Benchmarks

10

CPU-Heavy (11 Apps) Important computing apps with no evidence of GPU ports SPEC: Parser, Bzip, Gobmk, MCF, Sjeng, GemsFDTD [Serial] Parsec: Povray, Tonto, Facesim, Freqmine, Canneal [Parallel]

Mixed and GPU-Heavy (11 + 11 Apps) Rodinia (7 Apps) SPEC/Parsec mapped to GPUs (15 Apps)

Mixed

11

Benchmark Suite GPU Kernels Kernel SpeedupKmeans Rodinia 2 5.0

H264 SPEC 2 12.1SRAD Rodinia 2 15.0

Sphinx3 SPEC 1 17.7Particlefilter Rodinia 2 32.0

Blackscholes Parsec 1 13.7

Swim SPEC 3 25.3Milc SPEC 18 6.0

Hmmer SPEC 1 19.0LUD Rodinia 1 13.5

Streamcluster

Parsec 1 26.0

GPU-Heavy

12

Benchmark Suite GPU Kernels Kernel SpeedupBwaves SPEC 1 18.0Equake SPEC 1 5.3

Libquantum SPEC 3 28.1Ammp SPEC 2 6.8CFD Rodinia 5 5.5Mgrid SPEC 4 34.3

LBM SPEC 1 31.0Leukocyte Rodinia 3 70.0

Art SPEC 3 6.8Heartwall Rodinia 6 7.9

Fluidanimate Parsec 6 3.9

Methodology

13

Interested in Non-GPU portions of CPU-GPU code Ideal scenario: Port all applications on the GPU and use

hardware counters Man hours / Domain expertise needed / Platform and

architecture dependent code CPU-GPU partitioning based on expert information

Publically available source code (Rodinia) Details of GPU portions from publications and own

implementations (SPEC/Parsec)

Methodology

14

Microarchitectural simulations Marked GPU portions on application code Ran marked applications via PIN based microarchitectural

simulators (ILP, Branches, Loads and Stores) Machine measurements

Using marked code (CPU Criticality) Used parallel CPU source code when available (TLP studies)

Overview

15




CPU Criticality

16

17

CPU-Heavy

kmeans

h264sra

d

sphinx3

particle

filter

blacksch

olessw

im milc

hmmerlud

stream

cluste

r

Average

bwaves

equake

libquantu

mam

mp cfdmgrid lbm

leukocyte art

heartwall

fluidanimate

Average

Average(A

LL)0

10

20

30

40

50

60

70

80

90

100CPU Time

CPU-Only Non-Kernel Time With Reported Speedups With Conservative Speedups

Prop

ortio

n of

Tot

al A

pplic

ation

Tim

e (%

)

Mixed GPU-Heavy

Mixed: Even though 80% code is mapped to the GPU, the CPU is still

the bottleneck

CPU executes 7-14% of time even for GPU-Heavy

apps

More time spend on the CPU than on the GPU

Future Averages weighted by conservative CPU time

Instruction Level Parallelism

18

Measures inherent instruction stream parallelism Measured ILP with perfect memory and branches

19

parser bzip gobmk mcf sjeng gemsFDTD povray tonto facesim freqmine canneal Average0

5

10

15

20

25

30


Window Size 128 Window Size 512

Para

llel I

nstr

uctio

ns w

ithin

Inst

ructi

on W

indo

w

CPU-Heavy

12.7

9.6

CPU-Heavy

kmeans

h264sra

d

sphinx3

particle

filter

blacksch

olessw

im milc

hmmerlud

stream

cluste

r

Average

bwaves

equake

libquantu

mam

mp cfdmgrid lbm

leukocyte art

heartwall

fluidanimate

Average

Average(A

LL)0

5

10

15

20

25

30

Instruction Level ParallelismWindow Size 128 CPU OnlyWindow Size 128 with GPUWindow Size 512 CPU OnlyWindow Size 512 with GPU

Para

llel I

nstr

uctio

ns w

ithin

Inst

ructi

on W

indo

w

20Mixed GPU-Heavy

CPU+GPU

+GPUCPU

12.7

9.6

10.3 9.2 (128)

15.3 11.1 (512)

14.6 13.7

Overall9.9 9.5 (128)13.7 12.2

(512)


21

ILP dropped in 17 of 22 applications 4% for 128 size and 10.9% for 512 size Dropped by half for 5 applications Mixed apps ILP dropped by as much as 27.5%

Common case Independent loops mapped to the GPU Less regular dependence heavy code on the CPU

Occasionally long dependent chains on the GPU Blackscholes (total of 5/22 outliers)

Potential gains from larger windows are going to be degraded

Branches

22

Branches categorized into 4 categories Biased (> 95% same direction) Patterned (> 95% accuracy on very large local predictor) Correlated (> 95% accuracy on very large gshare predictor) Hard (Remaining)

parser

bzip

gobmkmcf

sjeng

gemsF

DTD

povray

tonto

faces

im

freqmine

cannea

l

Averag

e0

10

20

30

40

50

60

70

80

90

100Branch DistributionHard Correlated Patterned Biased

Perc

enta

ge o

f Dyn

amic

Bra

nche

s

23CPU-Heavy

24.7%

7.0%

13.1%

55.2%

24

0

10

20

30

40

50

60

70

80

90

100Branch DistributionSeries8 Series7 Series6 Series5

Perc

enta

ge o

f Dyn

amic

Bra

nche

s

+GPU

CPU

Mixed GPU-Heavy

11.3%

18.6%Effects of Data-Dependent branches on GPU-Heavy Apps

Overall: Branch predictors tuned for generic CPU execution may not be

sufficient

5.1% 9.4%

Effect of CPU-Heavy Apps

Loads and Stores

25

Loads and Stores categorized into 4 categories Static (> 95% same address) Strided (> 95% accuracy on very large stride predictor) Patterned (> 95% accuracy on very large Markov predictor) Hard (Remaining)

26

parser

bzip

gobmkmcf

sjeng

gemsF

DTD

povray

tonto

faces

im

freqmine

cannea

l

Averag

e0

10

20

30

40

50

60

70

80

90

100Distribution of LoadsHard Patterned Strided

Perc

enta

ge o

f Non

-Triv

ial L

oads

CPU-Heavy

77.5%

5.9%

16.6%

27

parser

bzip

gobmkmcf

sjeng

gemsF

DTD

povray

tonto

faces

im

freqmine

cannea

l

Averag

e0

10

20

30

40

50

60

70

80

90

100Distribution of StoresHard Patterned Strided

Perc

enta

ge o

f Non

-Triv

ial S

tore

s

CPU-Heavy

71.7%

10.2%

18.1%

28

0

10

20

30

40

50

60

70

80

90

100Distribution of LoadsSeries6 Series5 Series4

Perc

enta

ge o

f Non

-Triv

ial L

oads

Mixed GPU-Heavy

Overall: Stride or next line predictors will struggle

44.4%

61.6%

47.3%

27.0%

Effects of kernels with Irregular

accesses moving to the GPU

+GPU

CPU

29

0

10

20

30

40

50

60

70

80

90

100Distribution of StoresSeries6 Series5 Series4

Perc

enta

ge o

f Non

-Triv

ial S

tore

s

Mixed GPU-Heavy

Overall: Slightly less pronounced but similar

results as loads

38.6%

51.3%

48.6%

34.9%+GPU

CPU

30

parser

bzipgo

bmkmcf

sjeng

gemsFD

TDpovra

ytonto

facesim

freqmine

canneal

Averag

e0

5

10

15

20

25

30

35

40Vector Instructions

Perc

enta

ge o

f Dyn

amic

Inst

ructi

ons

CPU-Heavy

7.3%

31

CPU-Heavy

kmeans

h264sra

d

sphinx3

particle

filter

blacksch

olessw

im milc

hmmerlud

stream

cluste

r

Average

bwaves

equake

libquantu

mam

mp cfdmgrid lbm

leukocyte art

heartwall

fluidanimate

Average

Average(A

LL)0

5

10

15

20

25

30

35

40Vector Instructions

SSE Instructions SSE Instructions with GPU

Frac

tion

of D

ynam

ic In

stru

ction

s

Mixed GPU-Heavy

+GPU

CPU

15.0%

8.5%

Vector ISA enhancements targeting the same regions of code as the GPU

16.9%

9.6%

32

parser bzip gobmk mcf sjeng tonto facesim freqmine canneal Geomean0

5

10

15

20

Thread Level Parallelism8 Cores 32 Cores

Spp

edup

CPU-Heavy

33

CPU-Heavy

kmeans

srad

particle

filter

blacksch

olessw

im lud

stream

cluste

r

Geomean

equakeam

mp cfdmgrid

leukocyte art

heartwall

fluidanim

ate

Geomean

Geomean(ALL)

0

5

10

15

20

Thread Level Parallelism

8 Cores 8 Cores with GPU32 Cores 32 Cores with GPU

Spee

dup

Mixed GPU-Heavy

CPU+GPU

+GPUCPU

14.0x

2.1x

Abundant parallelism in

GPU-Heavy disappears.No gain going from 8 cores to

32 cores.

Overall 10% gain going from 8 cores to 32

cores.32 core TLP dropped 60%

from 5.5x to 2.2x

Mixed: Gains drop from 4x

to 1.4x

Overview

34




CPU Design in the post-GPU Era

35

Only modest gains from increasing window sizes Considerably increased pressure on branch predictor

In spite of fewer static branches Adopt techniques targeting fewer difficult branches (L-Tage Seznec 2007 )

Memory access will continue to be major bottlenecks Stride or next-line prefetching significantly much less relevant Lots of literature but never adapted on real machines (e.g. Helper thread

prefetching or mechanisms targeted at pointer chains) SSE rendered significantly less important

Every core need not have it / cores could share SSE hardware Extra CPU cores/threads not of much use because of lack of TLP

CPU Design in the post-GPU Era

36

(1) Clear case for Big Cores (with a focus on loads/stores/branches and not ILP) + GPUs

(2) Need to start adopting proposals for few-thread performance

(3) Start by relooking old techniques with current perspectives

Backup

On Using Unmodified Source Code

38

Most common memory layout change: AOS -> SOA Still a change in stride value AOS well captured by stride/markov predictors

CPU only code has even better locality well captured by strided/markov predictors

But the locality enhanced accesses map to the GPU Minimal impact on CPU code with GPU: still irregular accesses

redefining the role of the cpu in the era of cpu-gpu integration

Documents

nongpu portions of cpu

nongpu code

cpu cores

cpu design23

cpu design7apus

era of cpu

gpu codeideal scenario

evidence of gpu portsspec