high performance, energy efficient implementation of arm ... · •ccs receiver cap modeling...
TRANSCRIPT
© 2017 Synopsys, Inc. 1
High Performance, Energy Efficient Implementation of
ARM® Processors
Alan Gibbons, Principal Consultant, Synopsys Inc.
September 13th, 2017
© 2017 Synopsys, Inc. 2
Power
Gap
Source: ITRS System Driver Chapter 2010 Updates
• User experience demands:
• High levels of device functionality, HD graphics, fast response, long battery life
• Can continue to extend CPU processing power to satisfy the performance requirements
• Cannot continue to extend energy source to match – battery technology is not evolving fast enough
• Solution:
Need to become much smarter in
how we spend our power/energy budget
Must maximize our entitled performance
within an energy (or power) budget
Energy efficiency is a system problem and
needs a system solution - so industry
collaboration also essential
The Power Gap for Application Processors
© 2017 Synopsys, Inc. 3
1.0 GHz / 40nm
2.0GHz/28nm
3.0GHz+ in 16nm
3.5GHz in 7nm
2010
1.0GHz
40nm
2013
2.0+GHz
28nm+
2016
3.0+GHz
16nm
2017
3.5+GHz
7nm
• ARM & Synopsys have been collaborating for over 20 years
– Rapid, optimized implementation and verification of synthesizable ARM IP and sub-systems
– Low power methodology and the development of energy efficient ARM based sub-systems
– High performance verification, emulation and prototyping
• Why collaborate?
– To deliver a better User experience and better QoR
– Get early versions of IP, libraries and tools working
together from the outset
– De-risk the implementation for our mutual
customers
Collaboration Enables Early Adoption of ARM’s Latest IP
© 2017 Synopsys, Inc. 4
• Development of optimized implementations
– ARM Cortex-A75 CPU
– ARM Cortex-A55 CPU
– DynamIQ™ Shared Unit (DSU)
• Tuned implementations
– Optimizing performance, power and area
– TSMC 16FF+ technology
– ARM libraries and Synopsys EDA tools
• Available now
– RIs are available for download from SolvNet
– New QuickStart Implementation Kits (QIKs)
– Additional support via Synopsys Design Services
Latest Collaboration – ARM DynamIQ™ CPU Sub-System
© 2017 Synopsys, Inc. 5
Cortex-A75 Implementation – Major Challenges
Analyze Library for Balanced Power/Performance/TAT Power
OCV Sequential
Flops
Multi-Vt &
gate-length Multibit
Manage Crosstalk and Optimization for Best Frequency Performance Concurrent
Clock & Data
(CCD)
Global Route-
based
Optimization
Crosstalk
Optimization
PrimeTime
Delaycalc
Determine an Optimum Floorplan Area
Placement
Controls
Macro
Placement
Data Flow
Analysis Bounds
© 2017 Synopsys, Inc. 6
Performance and Power Managed Concurrently
Reduce Power Meet Timing
• place_opt CCD
• New global route-based opt.
• CCS receiver cap modeling
• PrimeTime delay calc in
route_opt
• Redundant VIA insertion
• Incremental timing-driven
multibit register banking and
de-banking
• Clock gating optimization
• Low power placement
• High effort leakage flow
• Timing-driven multibit register
banking and de-banking
• Physical-aware clock gating
• Low power placement
• Enhanced physical guidance
(eSPG)
• Enhanced layer-aware
optimization
• Placement pre-clustering
IC Compiler II
DC Graphical
PrimeTime
ECO • Path-based analysis (PBA)
• Clock skew ECO
• Physical-aware ECO
• Leakage-aware timing ECO
© 2017 Synopsys, Inc. 7
Cortex-A75 CPU Power Optimization Flow
• Deliver energy efficient performance
– Not simply “high performance, low power” rather
highest entitled performance within a
power/energy budget
– Optimal point(s) on a power v. performance curve
• Key considerations
– Vt class availability
– Multibit (MB) banking/de-banking
– Leakage vs. timing vs. dynamic optimization
– Leave headroom (both timing and power) for
ECO
• Library impacts all these decisions
Meet Power Target
MB banking/de-banking
1bit, 2bit, 4bit
VT selection
Across 12 vt/channel
options
QL (leakage) vs. Q (std)
vs. QA (area)
flop selection
SI TNS Reduction, very
congested, clock NDRs
fix_eco_power to meet
leakage target, expect
15-20% reduction
IC Compiler II
DC Graphical
PrimeTime
ECO
© 2017 Synopsys, Inc. 8
Vt Distribution Through Cortex-A75 Full Flow
In DC, datapath delay is prioritized
Faster cells are used
ECO brings in 4 new SVT classes
and does positive slack recovery
Pessimism reduces through the
flow and CTS brings in useful
skew, cells are swapped for power
DC Graphical
ULVT LVT SVT
c16 c18 c20 c24 c16 c18 c20 c24 c16 c18 c20 c24
ICC II route_opt PrimeTime ECO
Vt Class/Channel Mix Changes As Implementation Progresses
© 2017 Synopsys, Inc. 9
Floorplanning
• Floorplanning is and always has been, a key element in ARM CPU implementation
• Module and macro placement critical for hitting aggressive QoR targets
• Design topology, power switch mesh and power supply network also key considerations
• Macro placement, bounds and blockages impact both timing and power
Cortex-A57 Cortex-A72 Cortex-A73
Bounds
Blockage
Magnet
placement
SV SNUG 2016 SV SNUG 2015
© 2017 Synopsys, Inc. 10
QoR Challenges in Placement
• Design topology
– Analysis of Critical Module Timing on Cortex-
A75 CPU
• Critical paths to & from CORE sub-module
– CORE connects heavily to DSIDE and
DENGINE
– Critical paths seen throughout the flow
(challenging to fix downstream)
• CORE being “pushed” out of center of core
area and near IOs
• Created FMAX-limited paths due to long-
path buffering across block
DENGINE DSIDE
ISIDE
CORE
© 2017 Synopsys, Inc. 11
Cortex-A75 CPU Floorplan Changes
Move RAMs To Guide Module Placement
Move CORE module
back into the center
Floorplan changes that allowed CORE to float to the center and close to DSIDE
CORE
DSIDE
ISIDE
DENGINE
DENGINE
DSIDE
ISIDE
CORE
© 2017 Synopsys, Inc. 12
Crosstalk: An Ongoing Challenge on ARM Cores
• Lower geometry processes always
have a crosstalk component
• ARM CPUs have traditional SI
prevention
– Clock NDRs
– Congestion-aware placement
– Logic and density controls
• The Cortex-A73 flow used NDRs to
dramatically reduce crosstalk
• We have used all these techniques
plus more on the Cortex-A75 CPU SV SNUG 2016
© 2017 Synopsys, Inc. 13
Early Results - Cortex-A75 CPU
0%
20%
40%
60%
80%
100%
120%
140%
160%
180%
200%
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
place_opt clock_opt route_auto route_opt
First Full Flow Trial
TNS WNS Leakage (%)
• When starting the Cortex-A75 CPU flow, used
best practices from Cortex-A73 RI
– Clock NDRs
– Crosstalk threshold noise ratio of 20%
– Congestion optimization for placer settings
– 80ps max_transition limit
• Early results: crosstalk still an issue
– Large TNS increase at route_auto stage
– Leakage increase at route_opt stage
More attention needed to address crosstalk
© 2017 Synopsys, Inc. 14
After Before
Crosstalk Mitigation – Cortex-A75 CPU
New optimization solutions with better control of placement in both core area and macro channels
resulted in dramatic FMAX & power improvements
Decre
asin
g n
et
cro
ssta
lk d
elta d
ela
y
© 2017 Synopsys, Inc. 15
Concurrent Clock and Data Optimization - CCD
place_opt with data only: higher area & power
place_opt w/ useful skew: lower area & power
-100ps 200ps
50ps 50ps
Delay 150ps
Cortex-A75 CPU WNS TNS Leakage
Power
place_opt + route_opt -23 -95 100%
place_opt CCD + route_opt -20 -68 98%
Cortex-A75 CPU WNS TNS Leakage
Power
route_opt (Baseline) -112 -134 100%
Power CCD + route_opt -99 -126 99%
Apply
useful
skew
Datapath
area/power
recovery
CLK
90ps 10ps
CLK
90ps 10ps
Size down/swap LVT
Size up clock buf
© 2017 Synopsys, Inc. 16
DC Graphical + IC Compiler II + PrimeTime ECO
Results on Cortex-A75 CPU
0%
50%
100%
150%
200%
-150
-100
-50
0
50
100
150
DCG place_opt clock_opt route_opt Signoff ECO
% o
f F
ina
l
FM
AX
an
d L
eakag
e
TN
S (
ns)
Implementation Stages
Cortex-A75 CPU PPA
TNS (ns) FMAX (%) Leakage (%)
© 2017 Synopsys, Inc. 17
Summary
Manage power and performance concurrently – power is not an after thought
Analyze library for balanced power, performance and turnaround time
Determine an optimum floorplan – module and memory placement critical
Manage crosstalk as well as concurrent optimization of clock and data
© 2017 Synopsys, Inc. 18
Latest RIs Available Now
Synopsys Reference Implementations (RIs)
for Cortex-A75/-A55 are ready
• CPU and DSU flows
• TSMC 16nm FFC process
• ARM POP™ IP – core optimized
standard cells & fast cache RAMs
• Complete implementation and static
verification flows
Contact your Synopsys AC for additional information
RIs available on SolvNet (solvnet.synopsys.com/ARM-RI)