uncovering the multicore processor bottlenecks server design summit shay gal-on director of...
TRANSCRIPT
Uncovering the Multicore Uncovering the Multicore Processor BottlenecksProcessor Bottlenecks
Server Design SummitServer Design Summit
Shay Gal-OnShay Gal-OnDirector of Technology, EEMBCDirector of Technology, EEMBC
AgendaAgenda Introduction Basic concepts Sample results and analysis
Who is EEMBC?
Industry standards consortium focused on benchmarks for the embedded market.
Formed in 1997, and includes most embedded silicon and tools vendors.
Provides standards for automotive, networking, office automation, consumer devices, telecom, java, multicore and more.
Coremark – Coremark – Multicore ScalabilityMulticore Scalability
4
2365
9448.1
18859
28328.6
37765.5
1 4 8 12 16
CoreMark
CoreMark
Information provided by Cavium for CN58XXInformation provided by Cavium for CN58XX
5
0
5
10
15
20
25
30
35
40
45
CN5230-6004 cores
CN5640-8008 cores
CN5650-80012 cores
CN5860-80016 cores
Million
s P
acket
Per
Secon
d
Millions Packet Per Second
Multicore Scalability: Multicore Scalability: IP ForwardingIP Forwarding
• Information provided by Cavium
MultiBenchMultiBench
A suite of benchmarks from EEMBC, targeted A suite of benchmarks from EEMBC, targeted at multicore in general.at multicore in general.
Help decide how best to use a system.Help decide how best to use a system. Help select the best processor and/or system Help select the best processor and/or system
for the job.for the job.
?=
If cores were cars
Why MultiBench?Why MultiBench?
Workloads and Workloads and Work ItemsWork Items Multiple algorithmsMultiple algorithms Multiple datasetsMultiple datasets DecompositionDecomposition
WorkloadWork ItemA1
Work ItemA0
Work ItemB0
Concurrency within an item
Work Items and Work Items and WorkersWorkers
A collection of threads working on the same item are referred to as workers
Workload Workload CharacteristicsCharacteristics
Important to understand inherent Important to understand inherent characteristics of a workload.characteristics of a workload.
Determine which workloads are most relevant Determine which workloads are most relevant for you.for you.
Valuable information along with the algorithm Valuable information along with the algorithm description to analyze performance results.description to analyze performance results.
Classification with 8 characteristicsClassification with 8 characteristics4M-check-reassembly-tcp
0
0.5
1
1.
5
2
2.5store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-check-reassembly-tcp-cmykw2-rotatew2
0
0.2
0.4
0.6
0.8
1
1.2
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-tcp-mixed
0
0.5
1
1.
5
2
2.5
3
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-check-reassembly-tcp-x264w2
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
ipres-4M
0
1
2
3
4
5
6
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-check-reassembly
0
1
2
3
4
5store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
iDCT-4M
0
0.5
1
1.5
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
md5-4M
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-cmykw2
0
0.2
0.4
0.6
0.8
1
1.2
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-cmykw2-rotatew2
0
0.2
0.4
0.6
0.8
1
1.2
store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-x264w2
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
rotate-color1Mp
0
0.5
1
1.5
2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
4M-check
0
0.5
1
1.5store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
rotate-4Ms1
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
rotate-4Ms64
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
rotate-34kX128w1
0
0.2
0.4
0.6
0.8
1
1.2store
control
reg<=8
RL8
RL4K
WL4K
WL32K
Blocks 32..64
Correlation based feature subset selection + Genetic analysis.8 data points for 80% accuracy in performance prediction.
Tying it togetherTying it together
Take a couple of workloads and analyze Take a couple of workloads and analyze results on a few platforms, using results on a few platforms, using characteristics to draw conclusionscharacteristics to draw conclusions rotate-4Ms1 (One image at a time)rotate-4Ms1 (One image at a time) rotate-4Ms1w1 (Multiple images in parallel)rotate-4Ms1w1 (Multiple images in parallel) Same kernel with different run rules.Same kernel with different run rules.
90deg image rotation90deg image rotation
The platformsThe platforms
3 core processor, 2 HW threads / core3 core processor, 2 HW threads / core Soft core, tested on FPGASoft core, tested on FPGA
8 core processor, 4 HW threads / core 8 core processor, 4 HW threads / core Many-core processor (> 8)Many-core processor (> 8) GCC on all platforms.GCC on all platforms. Same OS type (Linux) on all platforms.Same OS type (Linux) on all platforms. Same ISA.Same ISA. Load balance left to the OS to decide.Load balance left to the OS to decide.
3-Core 3-Core Image Rotation Speedup Image Rotation Speedup
Here we are using parallelism to speed up processing of one image.
Rotate 4Ms1, multiple workers1004K, 32K L1
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%
1 2 3 4 5 6 7 8 9 10 11 12
Active contexts or workers
Sp
eed
up
No Cache
512K L2
Analysis for 3 Core?Analysis for 3 Core?
Overall performance benefit for full configuration is Overall performance benefit for full configuration is 2.7x vs 2.1x. However, with 3 workers active, a 2.7x vs 2.1x. However, with 3 workers active, a system with L2 is almost twice as efficient as the one system with L2 is almost twice as efficient as the one without. Not bad for a memory intensive workload.without. Not bad for a memory intensive workload.
Use L2? 2 or 3 cores? Depends on the headroom you Use L2? 2 or 3 cores? Depends on the headroom you need for other applications…need for other applications…
Performance Results - WorkersPerformance Results - Workers“many core” device“many core” device
Best performance at 5 cores activeBest performance at 5 cores active Likely due to sync and/or cache coherency effectsLikely due to sync and/or cache coherency effects
rotate-4Ms116 Core MIPS ISA Licensee
0%
50%
100%
150%
200%
250%
300%
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
Workers
Sp
eed
up
Best performance at 3 cores activeBest performance at 3 cores active Likely due to contention for memory.Likely due to contention for memory.
rotate-4Ms116 Core MIPS ISA Licensee
-40%
-20%
0%
20%
40%
60%
80%
100%
120%
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
Concurrent streams
Sp
eed
up
Performance Results - StreamsPerformance Results - Streams“many core” device“many core” device
Analysis – Many Core Analysis – Many Core Device?Device?
Assuming a part of our target application Assuming a part of our target application shares similar characteristics with this kernel, shares similar characteristics with this kernel, we can speed up processing of a single we can speed up processing of a single stream by allocating ~4 cores per stream, and stream by allocating ~4 cores per stream, and can efficiently process 2-3 streams at a time. can efficiently process 2-3 streams at a time.
Platform Platform Bottlenecks?Bottlenecks?
Cache coherence and synchronization issues above Cache coherence and synchronization issues above 4 workers exposed for this type of workload (memory 4 workers exposed for this type of workload (memory intensive).intensive).
Memory contention exposed for multiple streams with Memory contention exposed for multiple streams with that type of accessthat type of access 30% memory instructions * 3 streams saturate the 30% memory instructions * 3 streams saturate the
memory, and above that memory contention kills memory, and above that memory contention kills performance.performance.
Splurge for the many-core version? What will you run Splurge for the many-core version? What will you run on the other cores?on the other cores?
8 core8 corewith 4 hardware threads / corewith 4 hardware threads / core
Hardware threads enable 4x speedup.Hardware threads enable 4x speedup.
8 core8 corewith 4 hardware threads / with 4 hardware threads / corecore
Multiple streams scale even more (5.5x)Multiple streams scale even more (5.5x) Take care not to oversubscribeTake care not to oversubscribe
rotate-4Ms1w1 8 cores , 4 HW threads/core
0%100%200%300%400%500%600%
1 3 5 7 9 12 14 17 19 21 23 25 27 29 31
Number of contexts
Spee
dup
IP ReassemblyIP Reassembly
IP-reassembly workload over IP-reassembly workload over 4M, one platform actually drops 4M, one platform actually drops in performance!in performance!
Is it architecture or software that Is it architecture or software that makes scaling difficult?makes scaling difficult?
packet reassembly (ipres-4M) 3 cores , 2 HW threads/core
0%
10%
20%
30%
40%
50%
1 3 5 8 12
Number of workers/contexts
Spee
dup
packet reassembly (ipres-4M) 16 Cores
0%
10%
20%
30%
40%
50%
60%
1 3 5 8 12 16 24 30
Number of workers/contextsSp
eedu
p
packet reassembly (ipres-4M) 4 cores , non MIPS
-60%-50%-40%-30%-20%-10%0%
1 3 5 7
Number of workers/contexts
Spee
dup
3 CoreDifferent ISA
Many Core
SummarySummary Use your multiple cores wisely!Use your multiple cores wisely! Understanding the capabilities of your Understanding the capabilities of your
platform is a key to your ability to utilize them, platform is a key to your ability to utilize them, as much as understanding your code.as much as understanding your code.
Join EEMBC to use state of the art Join EEMBC to use state of the art benchmarks or help define the next benchmarks or help define the next generation.generation.
More at www.eembc.orgMore at www.eembc.org
Questions?
Let us look at MD5Let us look at MD5(A different workload in the suite)(A different workload in the suite)
Control – extremely low (mostly int ops)Control – extremely low (mostly int ops) Memory access pattern – sequentialMemory access pattern – sequential Memory ops – 20%Memory ops – 20% Typical for a computationally intensive Typical for a computationally intensive
workload.workload. Same platforms as beforeSame platforms as before
Speedup – 3 CoreSpeedup – 3 Core
>3x for multiple streams (250% increase in >3x for multiple streams (250% increase in performance)!performance)!
60% speedup for a single stream.60% speedup for a single stream.
MD5 Hash Workload1004K, 512K L2
0%
50%
100%
150%
200%
250%
300%
1 2 3 4 5 6 7 8 9 10 11 12
Active contexts or workers
Sp
eed
up
Multiple Workers
Multiple Streams
More then 3x on 3 cores?More then 3x on 3 cores?
Virtual CPU (VPE) effects 1004K
0
0.5
1
1.5
2
2.5
3
3.5
4
1 2 3
Cores Active
Sp
eed
up
x264-4Mqw1:1VPE
x264-4Mqw1:2VPE
md5-4M:1VPE
md5-4M:2VPE
rotate-4Ms64w1:1VPE
rotate-4Ms64w1:2VPE
Virtual CPU (thread) able to squeeze more performance for very little Virtual CPU (thread) able to squeeze more performance for very little additional silicon.additional silicon.
Only one of the 30 benchmarks in the suite did not gain performance Only one of the 30 benchmarks in the suite did not gain performance from utilizing HW thread technology.from utilizing HW thread technology.
Performance ResultsPerformance Results“many core”“many core”
Synchronization overhead comes into effect!Synchronization overhead comes into effect! Memory contention affirmedMemory contention affirmed
MD5 Hash Workload16 Cores MIPS ISA Licensee
0%
50%
100%
150%
200%
250%
300%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Active contexts or workers
Sp
eed
up
md5-4M
md5-4Mw1
8 core8 corewith 4 threads / corewith 4 threads / core
Higher compute load makes hardware threads shine with 9x Higher compute load makes hardware threads shine with 9x speedup on an 8 core system.speedup on an 8 core system.
Even single stream performance scales up to 5x.Even single stream performance scales up to 5x.
Backup - Architect
Suite AnalyzedSuite Analyzed
A standard subset of MultiBench.A standard subset of MultiBench. All workloads limited to 4M working set size All workloads limited to 4M working set size
per context activated.per context activated. 1 Context – 4M needed.1 Context – 4M needed. 4 Contexts – 16M will be needed.4 Contexts – 16M will be needed.
Standardized run rules and marks capturing Standardized run rules and marks capturing performance and scalability of a platform.performance and scalability of a platform.
What information?What information?
ILPILP Dynamic and static instruction distributionDynamic and static instruction distribution Memory profile (static and dynamic)Memory profile (static and dynamic) Cache effectsCache effects PredictabilityPredictability Synchronization eventsSynchronization events ……. more available and analyzed as the . more available and analyzed as the
industry adds new toolsindustry adds new tools
Why MultiBench?Why MultiBench? Multicore is everywhereMulticore is everywhere
Current metrics misleading (rate, DMIPS, etc)Current metrics misleading (rate, DMIPS, etc) Judging performance potential is much more Judging performance potential is much more
complex (as if benchmarking was not complex complex (as if benchmarking was not complex enough).enough).
Hence our focus on benchmarking embedded Hence our focus on benchmarking embedded multicore solutions.multicore solutions.
Need workloads close to real lifeNeed workloads close to real life
Important Important Workload CharacteristicsWorkload Characteristics
MemoryMemory 35% of the instructions are memory 35% of the instructions are memory moderate memory activity moderate memory activity
any memory bottlenecks will be Multicore related.any memory bottlenecks will be Multicore related. Control Control
extremely predictable extremely predictable any performance bottlenecks are not related any performance bottlenecks are not related to pipeline bubblesto pipeline bubbles
StridesStrides read access is sequential or nearly so, while write access has a stride read access is sequential or nearly so, while write access has a stride
of ~4K. Combined with the fact of high cache reuse and the nature of of ~4K. Combined with the fact of high cache reuse and the nature of the algorithm the algorithm cache coherency traffic. cache coherency traffic.
SyncSync Once per ~4K of data.Once per ~4K of data.
OtherOther? ? For this workload, the other characteristics do not provide additional For this workload, the other characteristics do not provide additional
insights.insights.