uncovering the multicore processor bottlenecks server design summit shay gal-on director of...

Uncovering the Multicore Uncovering the Multicore Processor BottlenecksProcessor Bottlenecks

Server Design SummitServer Design Summit

Shay Gal-OnShay Gal-OnDirector of Technology, EEMBCDirector of Technology, EEMBC

AgendaAgenda Introduction Basic concepts Sample results and analysis

Who is EEMBC?

Industry standards consortium focused on benchmarks for the embedded market.

Formed in 1997, and includes most embedded silicon and tools vendors.

Provides standards for automotive, networking, office automation, consumer devices, telecom, java, multicore and more.

Coremark – Coremark – Multicore ScalabilityMulticore Scalability

4

2365

9448.1

18859

28328.6

37765.5

1 4 8 12 16

CoreMark

CoreMark

Information provided by Cavium for CN58XXInformation provided by Cavium for CN58XX

5

0

5

10

15

20

25

30

35

40

45

CN5230-6004 cores

CN5640-8008 cores

CN5650-80012 cores

CN5860-80016 cores

Million

s P

acket

Per

Secon

d

Millions Packet Per Second

Multicore Scalability: Multicore Scalability: IP ForwardingIP Forwarding

• Information provided by Cavium

MultiBenchMultiBench

A suite of benchmarks from EEMBC, targeted A suite of benchmarks from EEMBC, targeted at multicore in general.at multicore in general.

Help decide how best to use a system.Help decide how best to use a system. Help select the best processor and/or system Help select the best processor and/or system

for the job.for the job.

?=

If cores were cars

Why MultiBench?Why MultiBench?

Workloads and Workloads and Work ItemsWork Items Multiple algorithmsMultiple algorithms Multiple datasetsMultiple datasets DecompositionDecomposition

WorkloadWork ItemA1

Work ItemA0

Work ItemB0

Concurrency within an item

Work Items and Work Items and WorkersWorkers

A collection of threads working on the same item are referred to as workers

Workload Workload CharacteristicsCharacteristics

Important to understand inherent Important to understand inherent characteristics of a workload.characteristics of a workload.

Determine which workloads are most relevant Determine which workloads are most relevant for you.for you.

Valuable information along with the algorithm Valuable information along with the algorithm description to analyze performance results.description to analyze performance results.

Classification with 8 characteristicsClassification with 8 characteristics4M-check-reassembly-tcp

0

0.5

1

1.

5

2

2.5store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-check-reassembly-tcp-cmykw2-rotatew2

0

0.2

0.4

0.6

0.8

1

1.2

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-tcp-mixed

0

0.5

1

1.

5

2

2.5

3

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-check-reassembly-tcp-x264w2

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

ipres-4M

0

1

2

3

4

5

6

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-check-reassembly

0

1

2

3

4

5store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

iDCT-4M

0

0.5

1

1.5

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

md5-4M

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-cmykw2

0

0.2

0.4

0.6

0.8

1

1.2

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-cmykw2-rotatew2

0

0.2

0.4

0.6

0.8

1

1.2

store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-x264w2

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

rotate-color1Mp

0

0.5

1

1.5

2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

4M-check

0

0.5

1

1.5store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

rotate-4Ms1

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

rotate-4Ms64

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

rotate-34kX128w1

0

0.2

0.4

0.6

0.8

1

1.2store

control

reg<=8

RL8

RL4K

WL4K

WL32K

Blocks 32..64

Correlation based feature subset selection + Genetic analysis.8 data points for 80% accuracy in performance prediction.

Tying it togetherTying it together

Take a couple of workloads and analyze Take a couple of workloads and analyze results on a few platforms, using results on a few platforms, using characteristics to draw conclusionscharacteristics to draw conclusions rotate-4Ms1 (One image at a time)rotate-4Ms1 (One image at a time) rotate-4Ms1w1 (Multiple images in parallel)rotate-4Ms1w1 (Multiple images in parallel) Same kernel with different run rules.Same kernel with different run rules.

90deg image rotation90deg image rotation

The platformsThe platforms

3 core processor, 2 HW threads / core3 core processor, 2 HW threads / core Soft core, tested on FPGASoft core, tested on FPGA

8 core processor, 4 HW threads / core 8 core processor, 4 HW threads / core Many-core processor (> 8)Many-core processor (> 8) GCC on all platforms.GCC on all platforms. Same OS type (Linux) on all platforms.Same OS type (Linux) on all platforms. Same ISA.Same ISA. Load balance left to the OS to decide.Load balance left to the OS to decide.

3-Core 3-Core Image Rotation Speedup Image Rotation Speedup

Here we are using parallelism to speed up processing of one image.

Rotate 4Ms1, multiple workers1004K, 32K L1

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

1 2 3 4 5 6 7 8 9 10 11 12

Active contexts or workers

Sp

eed

up

No Cache

512K L2

Analysis for 3 Core?Analysis for 3 Core?

Overall performance benefit for full configuration is Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload.without. Not bad for a memory intensive workload.

Use L2? 2 or 3 cores? Depends on the headroom you Use L2? 2 or 3 cores? Depends on the headroom you need for other applications…need for other applications…

Performance Results - WorkersPerformance Results - Workers“many core” device“many core” device

Best performance at 5 cores activeBest performance at 5 cores active Likely due to sync and/or cache coherency effectsLikely due to sync and/or cache coherency effects

rotate-4Ms116 Core MIPS ISA Licensee

0%

50%

100%

150%

200%

250%

300%

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

Workers

Sp

eed

up

Best performance at 3 cores activeBest performance at 3 cores active Likely due to contention for memory.Likely due to contention for memory.

rotate-4Ms116 Core MIPS ISA Licensee

-40%

-20%

0%

20%

40%

60%

80%

100%

120%

1 3 5 7 9 11

13

15

17

19

21

23

25

27

29

31

Concurrent streams

Sp

eed

up

Performance Results - StreamsPerformance Results - Streams“many core” device“many core” device

Analysis – Many Core Analysis – Many Core Device?Device?

Assuming a part of our target application Assuming a part of our target application shares similar characteristics with this kernel, shares similar characteristics with this kernel, we can speed up processing of a single we can speed up processing of a single stream by allocating ~4 cores per stream, and stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time. can efficiently process 2-3 streams at a time.

Platform Platform Bottlenecks?Bottlenecks?

Cache coherence and synchronization issues above Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory 4 workers exposed for this type of workload (memory intensive).intensive).

Memory contention exposed for multiple streams with Memory contention exposed for multiple streams with that type of accessthat type of access 30% memory instructions * 3 streams saturate the 30% memory instructions * 3 streams saturate the

memory, and above that memory contention kills memory, and above that memory contention kills performance.performance.

Splurge for the many-core version? What will you run Splurge for the many-core version? What will you run on the other cores?on the other cores?

8 core8 corewith 4 hardware threads / corewith 4 hardware threads / core

Hardware threads enable 4x speedup.Hardware threads enable 4x speedup.

8 core8 corewith 4 hardware threads / with 4 hardware threads / corecore

Multiple streams scale even more (5.5x)Multiple streams scale even more (5.5x) Take care not to oversubscribeTake care not to oversubscribe

rotate-4Ms1w1 8 cores , 4 HW threads/core

0%100%200%300%400%500%600%

1 3 5 7 9 12 14 17 19 21 23 25 27 29 31

Number of contexts

Spee

dup

IP ReassemblyIP Reassembly

IP-reassembly workload over IP-reassembly workload over 4M, one platform actually drops 4M, one platform actually drops in performance!in performance!

Is it architecture or software that Is it architecture or software that makes scaling difficult?makes scaling difficult?

packet reassembly (ipres-4M) 3 cores , 2 HW threads/core

0%

10%

20%

30%

40%

50%

1 3 5 8 12

Number of workers/contexts

Spee

dup

packet reassembly (ipres-4M) 16 Cores

0%

10%

20%

30%

40%

50%

60%

1 3 5 8 12 16 24 30

Number of workers/contextsSp

eedu

p

packet reassembly (ipres-4M) 4 cores , non MIPS

-60%-50%-40%-30%-20%-10%0%

1 3 5 7

Number of workers/contexts

Spee

dup

3 CoreDifferent ISA

Many Core

SummarySummary Use your multiple cores wisely!Use your multiple cores wisely! Understanding the capabilities of your Understanding the capabilities of your

platform is a key to your ability to utilize them, platform is a key to your ability to utilize them, as much as understanding your code.as much as understanding your code.

Join EEMBC to use state of the art Join EEMBC to use state of the art benchmarks or help define the next benchmarks or help define the next generation.generation.

More at www.eembc.orgMore at www.eembc.org

Questions?

Let us look at MD5Let us look at MD5(A different workload in the suite)(A different workload in the suite)

Control – extremely low (mostly int ops)Control – extremely low (mostly int ops) Memory access pattern – sequentialMemory access pattern – sequential Memory ops – 20%Memory ops – 20% Typical for a computationally intensive Typical for a computationally intensive

workload.workload. Same platforms as beforeSame platforms as before

Speedup – 3 CoreSpeedup – 3 Core

>3x for multiple streams (250% increase in >3x for multiple streams (250% increase in performance)!performance)!

60% speedup for a single stream.60% speedup for a single stream.

MD5 Hash Workload1004K, 512K L2

0%

50%

100%

150%

200%

250%

300%

1 2 3 4 5 6 7 8 9 10 11 12


Sp

eed

up

Multiple Workers

Multiple Streams

More then 3x on 3 cores?More then 3x on 3 cores?

Virtual CPU (VPE) effects 1004K

0

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3

Cores Active

Sp

eed

up

x264-4Mqw1:1VPE

x264-4Mqw1:2VPE

md5-4M:1VPE

md5-4M:2VPE

rotate-4Ms64w1:1VPE

rotate-4Ms64w1:2VPE

Virtual CPU (thread) able to squeeze more performance for very little Virtual CPU (thread) able to squeeze more performance for very little additional silicon.additional silicon.

Only one of the 30 benchmarks in the suite did not gain performance Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology.from utilizing HW thread technology.

Performance ResultsPerformance Results“many core”“many core”

Synchronization overhead comes into effect!Synchronization overhead comes into effect! Memory contention affirmedMemory contention affirmed

MD5 Hash Workload16 Cores MIPS ISA Licensee

0%

50%

100%

150%

200%

250%

300%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32


Sp

eed

up

md5-4M

md5-4Mw1

8 core8 corewith 4 threads / corewith 4 threads / core

Higher compute load makes hardware threads shine with 9x Higher compute load makes hardware threads shine with 9x speedup on an 8 core system.speedup on an 8 core system.

Even single stream performance scales up to 5x.Even single stream performance scales up to 5x.

Backup - Architect

Suite AnalyzedSuite Analyzed

A standard subset of MultiBench.A standard subset of MultiBench. All workloads limited to 4M working set size All workloads limited to 4M working set size

per context activated.per context activated. 1 Context – 4M needed.1 Context – 4M needed. 4 Contexts – 16M will be needed.4 Contexts – 16M will be needed.

Standardized run rules and marks capturing Standardized run rules and marks capturing performance and scalability of a platform.performance and scalability of a platform.

What information?What information?

ILPILP Dynamic and static instruction distributionDynamic and static instruction distribution Memory profile (static and dynamic)Memory profile (static and dynamic) Cache effectsCache effects PredictabilityPredictability Synchronization eventsSynchronization events ……. more available and analyzed as the . more available and analyzed as the

industry adds new toolsindustry adds new tools

Why MultiBench?Why MultiBench? Multicore is everywhereMulticore is everywhere

Current metrics misleading (rate, DMIPS, etc)Current metrics misleading (rate, DMIPS, etc) Judging performance potential is much more Judging performance potential is much more

complex (as if benchmarking was not complex complex (as if benchmarking was not complex enough).enough).

Hence our focus on benchmarking embedded Hence our focus on benchmarking embedded multicore solutions.multicore solutions.

Need workloads close to real lifeNeed workloads close to real life

Important Important Workload CharacteristicsWorkload Characteristics

MemoryMemory 35% of the instructions are memory 35% of the instructions are memory moderate memory activity moderate memory activity

any memory bottlenecks will be Multicore related.any memory bottlenecks will be Multicore related. Control Control

extremely predictable extremely predictable any performance bottlenecks are not related any performance bottlenecks are not related to pipeline bubblesto pipeline bubbles

StridesStrides read access is sequential or nearly so, while write access has a stride read access is sequential or nearly so, while write access has a stride

of ~4K. Combined with the fact of high cache reuse and the nature of of ~4K. Combined with the fact of high cache reuse and the nature of the algorithm the algorithm cache coherency traffic. cache coherency traffic.

SyncSync Once per ~4K of data.Once per ~4K of data.

OtherOther? ? For this workload, the other characteristics do not provide additional For this workload, the other characteristics do not provide additional

insights.insights.

uncovering the multicore processor bottlenecks server design summit shay gal-on director of...

Documents