45nm next generation intel® core™ microarchitecture (penryn) · 2013-07-28 · 45nm next...
TRANSCRIPT
45nm Next Generation
Intel® Core™
Microarchitecture (Penryn)
HOT CHIPS 2007
Varghese GeorgeVarghese George
Principal Engineer, Intel Corp
2
Legal DisclaimerToday’s presentation may contain forward-looking statements. All statements made that are not historical facts are subjectto a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent EarningsRelease and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors thatcould cause actual results to differ.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTEDBY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCHPRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIEDWARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIESRELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, lifesaving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined."Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arisingfrom future changes to them.
The Intel® Core™ Microarchitecture, Intel® Pentium, Intel® Pentium II, Intel® Pentium III, Intel® Pentium 4, Intel® PentiumPro, Intel® Pentium D, Intel® Pentium M , Itanium®, Xeon® may contain design defects or errors known as errata whichmay cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your productorder.
This document contains information on products in the design phase of development. Do not finalize a design with thisinformation. Revised information will be published when the product is available. Verify with your local sales office that youhave the latest datasheet before finalizing a design.
All dates specified are target dates, are provided for planning purposes only and are subject to change.
All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposesonly, and are subject to change without notice.
Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States andother countries.
*Other names and brands are the property of their respective owners.
Copyright © 2007, Intel Corporation
3
Outline
Next Generation 45nm Process Technology benefits
Overview of Penryn Architecture and New Features
Highlights of New Microarchitecture changes
New Power Management Features in Penryn
� Deep Power Down Technology
� Enhanced Dynamic Acceleration Technology
� CC3 in Servers
Summary
4
Next generation 45 nm Hi-K metal gate process
technology: Continuing Moore’s Law65 nm
Intel® Core™ 2 Duo143 mm2
45 nmPenryn family die (dual core)
107 mm2
4-wide Dynamic ExecutionAdvanced smart cache: 4M L2Smart Memory accessAdvanced digital media boostIntelligent power capability
Same good stuff plus:- 6M L2 cache- 47 new SSE4 instructions- Micro-arch enhancements for further IPC perf- Deep Power Down Technology*- Enhanced Dynamic Acceleration Technology*
Penryn advantages with this next generation process:
� Lower transistor switching power and leakage current => reduced idle power
=> longer battery life, quieter systems
� Up to 2x improvement in transistor density: Significant area scalability to
support additional capabilities within the existing power envelope
� Higher core and bus frequency for more performance in the same power
envelope* Mobile only features
5
Enhanced Intel® Core™ Microarchitecture
Intel Core MicroarchitectureIntel Core Microarchitecture New with the Penryn FamilyNew with the Penryn Family
Improved Store Forwarding
Higher bus speeds
Intel SSE4 instructions
Super Shuffle Engine
Deep Power Down Technology
Enhanced Intel Dynamic Acceleration Tech
Fast Radix-16 Divider
Faster OS Primitive Support
Enhanced Intel Virtualization Technology
Intel® Wide Dynamic Execution
Intel® Advanced Smart Cache
Intel® Smart Memory Access
Intel® Advanced Digital Media Boost
Intel® Intelligent Power Capability
Larger Caches: up to 12MB
24 Way Set Associativity
Increased Performance and Energy Efficiency Across Applications
Covered briefly in this presentation Focus of this presentation
6
IntelIntel®®
SSESSEIntelIntel®®
SSE4SSE4
IntelIntel®®
SSE2SSE2
1999 2000
IntelIntel®®
SSE3SSE3
2004
IntelIntel®®
SSSE3SSSE3
2006
200770
instructions
�Single-PrecisionVectors,
�streamingoperations
144instructions
�Double-precisionVectors
�128-bitvectorinteger
13instructions
�ComplexArithmetic
32instructions
� Decode47 instructions
� Video Accelerators
� Graphics building blocks
� Coprocessor Accelerators
�� Penryn supports 47 new instructionsPenryn supports 47 new instructions
�� Largest set of new instructions introduced since 2000Largest set of new instructions introduced since 2000
�� This continues a trend set by SSE, SSE2, SSE3, and SSSE3This continues a trend set by SSE, SSE2, SSE3, and SSSE3
�� Addresses some popular requests for key functionalityAddresses some popular requests for key functionality�� Features to improve the compilerFeatures to improve the compiler’’s ability to efficiently vectorize codes ability to efficiently vectorize code
�� Video Encode acceleration functionsVideo Encode acceleration functions
�� Floating-point dot product operation (3D content)Floating-point dot product operation (3D content)
�� Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)
New Intel® SSE4 Instructions
7
Super Shuffle Engine
. . .
Input data (2 x 16 bytes)
. . .
Output data (16 bytes)
src1
src2
logic op
� Shuffle operations required for SSE
data formatting operations
� Unpacking
� Packing
� Align concatenated sources
� Wide shifts
� Insertion and extraction
� Setup for horizontal arithmetic
functions
� Penryn super shuffle engine
performs 128 bit operation in a
single cycle
� No software changes required
Super
Shuffle
engine
. . .logic op
. . .
B0 B1 B2 B15. . .B3 B4 B14
B0 B1 B2 B15. . .B3 B14B4
B0 B1 B2 B15. . .B3 B14B4
dest
Doubles shuffle throughput
8
New Radix-16 divider architecture
� Leading edge divider performance
� >50% speed-up over previous generations on average
� Innovative radix-16 based architecture
� Double pumped: processes 2x2 bits per cycle
� Hybrid 68-bit CSA/CPA based for faster pipelined
quotient select
� Optimized square root
Pre-processing
hybrid 68badder
QSL
hybrid 68badder
QSL
Post-processing
Quotient/remainder
Dividend Divisor
<<2
<<2
EP
square
root
DP
square
root
SP
square
root
EP
divide
DP
divide
SP divide32-bit
int
divide
Clo
cks
Penryn Previous generation
Maximum Processor Instruction Latency(lower is better)
9
Streaming Load Instruction� 16 byte Aligned load instruction on WC (write combining) memory
� Improves read bandwidth from WC mem by reading cache-line size quantities
GFX
High BW Writes
But … Low BW Reads
Graphics Mem
GFX
DD
R3
DD
R3
Fast Streaming Reads!
800 MB/s (partial reads) 800 MB/s (partial reads)
Up to 6.4 GB/s (cache line reads)Up to 6.4 GB/s (cache line reads)
Wit
h S
tream
ing
Lo
ad
sExis
tin
gP
latf
orm
s
45nm Next
Generation
Core ™
Core ™
Streaming Load is ~8X faster reading from WC Memory
Temp Buffer inside
L1 Data Cache
A B C D
L2 Cache
Bus
4th Access3rd Access
2nd Access1st Access
(Data usage in processor core)
DD
R3
DD
R3
10
Penryn builds on the capabilities in the Core™ microarchitecture
� Fine grained clock gating during runtime: Applied all across the chip
� Sleep transistors: In large arrays such as L2 cache to reduce leakage power
� C-states: Various low power states during CPU idle
� P-states: Multiple performance states for run-time energy efficiency
� Thermal Monitor: Self-managed thermal control based on multiple thermal sensors
� Digital Thermal Sensor: For software access to CPU temperature
� Intelligent Voltage Regulator control: To optimize VR efficiency during low loads
� PECI interface: For side-band access to temperature for fan control
Key New Mobile Power/Thermal Features in Penryn
� Deep Power Down Technology (DPD): A radically new and innovative
idle power management state
� Enhanced Dynamic Acceleration Technology (EDAT): A deterministic
and (TDP) power-neutral performance boost for Single Threaded apps
Power Management Overview
11
Deep Power Down Technology Overview
Vcc: << Retention
Core clk: OFF
PLL: OFF
Caches: OFF
State in DPDState in DPDFlexible C-States to Select Idle Power Level vs. Responsiveness
Core voltage*
Core clock
PLL
L1 caches
L2 cache
Wakeup time*
Idle power*
C0 C1 C3 C4
off
Active state
off
f lushedflushed
off
off
f lushedflushed
off
partial flushpartial flush
active
* Rough approximation
Deep Power Down
technology state
off
off
off
off
New
fo
r Penry
n
Deep Power Down
technology state
off
off
off
off
New
fo
r Penry
n
Concept: When idle, required CPU state is saved in a power-preserved domain onchip and then power is shut off
Exit: Upon a break event, Power/Clocks are turned on, State is restored andexecution resumes seamlessly from where it left off transparent to OS / applications
� Can occur few 100s of times per second (between keystrokes)
DPD enables reaching lower limit of CPU idle power of 0 W
� Lowest Leakage CPUstate you can get to!
� Agnostic to min Vccstate retention issues
Key benefits of DPDKey benefits of DPD
12
Penryn DPD Implementation Overview
8KB SRAMs
4 Major parts:
STATE STORAGE:
� 8KB (per core)
� Powered from I/O Vcc (VccP)
� ECC protected
STATE DEFINITION:
� What to include?
� Criteria: “Software seamless”
� Inclusions:� All Architectural state
� Most micro-architectural state
� Exclusions:� Temp registers used by ucode
� Some others on a case by case basis
VccP
VccP
MICROCODE:
� Does state save and restore
� Manages core synchronization
� Supports Auto-demote policy
PM UNIT (“hardware”):
� Coordinates cores at hardware level
� Manages the DPD Power up sequence
� Manages entry/exit protocol with platform
13
Deep Power Down Technology Entry/Exit
OS: Decides
to idle the
processors
OS: execute
MWAIT
instruction
CPU: shrink L2;
save uarch
state
CPU: signal
chipset to
enter DPD
Deep power
down state
Interrupt
break
event
CPU: Internal
RESET,
PLL relock etc
CPU: restore
Arch/uarch state,
Expand L2
CPU: continue
execution to
next instr.
Chipset
VRM
I/O
requests
FSB I/FVID control
Vcc
VccP
C/S: stops
CPU clk, blk I/O
CPU: VRM dn
� S/W instruction initiates processor DPD entry
� CPU does rest of sequencing with chipset/platform
� Protocol with chipset to block snoops due to DMAtraffic (no CPU wakeup required) while in DPD state
� Exit initiated by an break event in platform (interrupt)through the chipset
� CPU sequences through rest of exit sequence –including driving VID to VRM, internal hardwarereset, state restore and execution resumption
C/S: signals
CPU wakeup,
VRM ramp, bclk
Small amout
of logic is left
powered
up by VccP
during DPD
VccP
14
Interrupt Rate Sensitivity of Average Power
Deeper C-states have non-trivial energy cost for transitions
Today, OS makes C-state decisions based on latency, not energy cost
� Latency could be ok, but energy cost could be significantly different
Too frequent transitions into deep C-states could result in net energy loss
� Worse battery life
Penryn implements a “Auto-
demote” policy in the CPU to
address this issue� Heuristics to detects when energy
cost of DPD state entry and exit is
potentially > savings in DPD state
� Demotes such cases to shallower
C-state (C4), although the OS
requested DPD
� Silicon results prove that this was a
good decision!
Limit
Avera
ge P
ow
er
Example of Average Power MM02
1ms0.5ms 3ms2ms 5ms4ms 15ms10ms 50ms20ms
Interrupt Rate
with Deep Power Down technologywith C4
Deep Power Down crossover point: ~3-4ms
Source: Intel
15
DPD Results (Average Power)
MM05 – BAPCO* Mobile Mark* 2005 benchmark
� 27% to 44% (based on the leakage of the part) reduction on Average Power as
measured by Mobile Mark – Office Productivity benchmark due to DPD feature
� Significant improvement compared to previous generation (Merom)
� Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range
ActualResults
measured inPenryn
Silicon – Upto 44%
reduction inAveragePower
**Config: Data from a sample distribution of parts with various leakage
measured on tester, under respective product POR Vcc, temperature etc
Source: Intel Corp, June 2007**
Average Power (MM05* - Office Productivity)
Leakage Current
Av
era
ge
Po
we
r
Penryn with DPD Disabled
Merom (Core (tm) 2 Duo processor)
Penryn with DPD Enabled
27%
44%
16
Enhanced Dynamic Acceleration Technology (EDAT)
Concept: In multi-core CPUs, use the power headroom of idle core to boostperformance of the non-idle core
How it works:
� When one core enters an idle power C-state (CC3 or deeper) AND
� OS requests a higher performance state on the running core,
� The non-idle core is boosted up to a higher voltage, higher frequency (EDAT freq)
� Overall chip Power still remains within the specified Thermal Design Power (TDP)
Min Vcc
TDP Top Freq.
CORE 0 CORE 1
Fre
qu
en
cy
2 cores active: Guaranteed
(marked) frequency
Min Vcc
TDP Top Freq.
EDAT Bin
CORE 0 CORE 1
Single core active: EDAT freq
Min Vcc
TDP Top Freq.
Core #2 at
C3-DPD
Leakage
P = P0 + P1 = TDP SpecP = P0 + P1 <= P (TDP Spec)
Fre
qu
en
cy
Applicability: EDAT isapplicable mostly tothermally constrainedplatforms such asmobile where theguaranteed frequencyis not max Vcc limited
EDAT provides a significant and predictable single-threaded performance boost
17
Max F/V
Max F/V
EDAT Implementation Overview
F/V
Clipping
LogicEDAT F/V
Guar. F/V
OS P-state Request “P[0]”
EDAT
Disable
EDAT
Control
Logic
EDAT Logic
Core
CC-state
Hysteresis
Mechanism
Microarchitecture� Entry based on OS request AND other core idle
� Idle core defined as “CC3” or deeper C-state
� EDAT Freq pre-programmed in chip based onpower, reliability and other constraints
� Exit EDAT mode when Idle core wakes up
Hysteresis mechanism� Allows short durations where 2 cores active
� Reduces perf loss for low activity wakeups
� Implemented using a few counters
� Voltage Regulator needs to provide for this
� Benefits most at high timer tick rates
OS interface� EDAT Freq exposed as {Guaranteed + epsilon}
P-state in ACPI table: P[0]
� OS requests P[0] state if perf demand exists
� EDAT logic grants it if power headroom exists
18
EDAT Results (Performance)
EDAT Performance on SPEC CPU2000 (Estimated)
0.96
0.98
1.00
1.02
1.04
1.06
1.08
Baseline: EDAT OFF EDAT ON; Low
Interrupt rate
EDAT ON; High
Interrupt rate
EDAT ON with
Hysteresis: High
Interrupt rate
Re
lati
ve
Pe
rfo
rma
nc
e
SPECint_base2000 (estimated)
SPECfp_base2000 (estimated)
� 5% (on SPECfp_base2000 (est*)) and 7% (on SPECint_base2000 (est*)) performance
gains due to EDAT within the same TDP power envelope!**
� High timer tick rates would have reduced the gains by a few % due to frequent wake up of
idle core and resulting exits from EDAT mode
� Intelligent hysteresis mechanism in Penryn recovers most of that loss
*Estimated SPEC metrics based on measurements on preproduction
platforms. SPEC, SPECint and SPECfp are trademarks of SPEC. For
more information on these benchmarks, see www.spec.org
**Configuration used for measurement: Matanzas CRB board, Crestline chipset DDR2-667
memory, Penryn A1 silicon @ 2.4 GHz, FSB 800, OS: Windows XP SP2
Source: Intel Corp, as of June 2007*
ActualResults
measured inPenryn
Silicon – Upto 7% boost
in ST perfdue to EDAT
19
New PM feature for Servers: CC3 State
L2 Cache
Core 0
Core 1L2 Lookup
cross-coresnoop
Core0 in C1ECore0 in CC3
L2 Cache
Core 0
Core 1
L2 Cache
Core 0
Core 1
L2 Cache
Core 0
Core 1
L2 Cache
Core 0
Core 1
Chipset Memory
DRAM access
L2snoop
Coresnoop
Coresnoop
ACTIVECores IDLE Cores (CC1)
L2snoop
Coresnoop
Coresnoop
IDLE Cores (CC1)
L2snoop
IDLE Cores (CC1)IDLE Cores (CC3) IDLE Cores (CC3) IDLE Cores (CC3)
Coresnoop
Coresnoop
Concept:
� Snoops burn ~30% of active core power
� By avoiding snoops into idle cores, we save power
How it works:
� Today, idle cores are put into Core C1 (CC1), which is a snoopable state
� In Penryn, idle cores can be put into Core C3 (CC3), which is a non-snoopable state.
� First level caches are flushed into the L2 cache before putting cores into CC3
� This prevents cross core snoops and therefore the additional power burnt for snoops
CC3 is a simple but effective optimization for power savings
20
CC3 Results (Power savings)
*Configuration used for measurement: Bridgeport2, DP Clovertown 2.33 GHz, 1333 MHz
Bus, 4x2GB FB DIMM 667 MHz, OS: Linux 2.6.18.1; Workload: Specweb99; DBS OFF;
Measured by mapping CC3 as CC1 in BIOS
** Note that the power savings due to this feature will be system
configuration, OS and workload dependent.
Power savings due to CC3 in Web Server workloads
-18.0%
-16.0%
-14.0%
-12.0%
-10.0%
-8.0%
-6.0%
-4.0%
-2.0%
0.0%
2.0%
0.0% 20.0% 40.0% 60.0% 80.0% 100.0%
% CPU utilization
% Perf loss
% CPU pwr reduction
Source: Intel Corp, as of June 2007*
� Up to ~16%** CPU power reduction measured on Web Server workload with
negligible performance loss with CC3 feature enabled
� Similar power reductions seen on SPEC CPU2000 (estimated by measurement on
preproduction silicon) (single user mode), other idle power benchmarks for Server
Up to 16% CPU
powerreduction
measured onweb server
workloads with
CC3 feature
21
Moore’s Law is alive and well with Penryn;Innovative features with tangible benefits being
introduced for Performance and Power
Summary
New generation Intel® Core™ microarchitecture is leading
the way with 45 nm process technology
New microarchitecture enhancements for higher
performance and improved energy efficiency
New SSE4 instructions for improved video, imaging, and 3D
content performance
New Power management features for improved battery life,
performance and energy efficiency
22
Acknowledgements
To the entire Penryn team across Folsom, Santa Clara,
Costa Rica and Israel and the Technology Development
Group for developing and delivering this product to market
23
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across differentprocessor families. See http://www.intel.com/products/processor_number for details.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel
products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers
should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information
on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/index.htm
Intel may make changes to specifications, release dates and product descriptions at any time, without notice. Intel, Pentium and the Intel logo aretrademarks or registered trademarks of the Intel Corporation or its subsidiaries in the United States and other countries.
Legal Notices and Important InformationRegarding the performance and power measurements in this presentation
*Other names and brands may be claimed as the property of others