45nm next generation intel® core™ microarchitecture (penryn) · 2013-07-28 · 45nm next...

45nm Next Generation

Intel® Core™

Microarchitecture (Penryn)

HOT CHIPS 2007

Varghese GeorgeVarghese George

Principal Engineer, Intel Corp

2

Legal DisclaimerToday’s presentation may contain forward-looking statements. All statements made that are not historical facts are subjectto a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent EarningsRelease and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors thatcould cause actual results to differ.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE,EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTEDBY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCHPRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIEDWARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIESRELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT,COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, lifesaving, or life sustaining applications.

Intel may make changes to specifications and product descriptions at any time, without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined."Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arisingfrom future changes to them.

The Intel® Core™ Microarchitecture, Intel® Pentium, Intel® Pentium II, Intel® Pentium III, Intel® Pentium 4, Intel® PentiumPro, Intel® Pentium D, Intel® Pentium M , Itanium®, Xeon® may contain design defects or errors known as errata whichmay cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your productorder.

This document contains information on products in the design phase of development. Do not finalize a design with thisinformation. Revised information will be published when the product is available. Verify with your local sales office that youhave the latest datasheet before finalizing a design.

All dates specified are target dates, are provided for planning purposes only and are subject to change.

All products, dates, and figures specified are preliminary based on current expectations, provided for planning purposesonly, and are subject to change without notice.

Intel and the Intel logo is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States andother countries.

*Other names and brands are the property of their respective owners.

Copyright © 2007, Intel Corporation

3

Outline

Next Generation 45nm Process Technology benefits

Overview of Penryn Architecture and New Features

Highlights of New Microarchitecture changes

New Power Management Features in Penryn

� Deep Power Down Technology

� Enhanced Dynamic Acceleration Technology

� CC3 in Servers

Summary

4

Next generation 45 nm Hi-K metal gate process

technology: Continuing Moore’s Law65 nm

Intel® Core™ 2 Duo143 mm2

45 nmPenryn family die (dual core)

107 mm2

4-wide Dynamic ExecutionAdvanced smart cache: 4M L2Smart Memory accessAdvanced digital media boostIntelligent power capability

Same good stuff plus:- 6M L2 cache- 47 new SSE4 instructions- Micro-arch enhancements for further IPC perf- Deep Power Down Technology*- Enhanced Dynamic Acceleration Technology*

Penryn advantages with this next generation process:

� Lower transistor switching power and leakage current => reduced idle power

=> longer battery life, quieter systems

� Up to 2x improvement in transistor density: Significant area scalability to

support additional capabilities within the existing power envelope

� Higher core and bus frequency for more performance in the same power

envelope* Mobile only features

5

Enhanced Intel® Core™ Microarchitecture

Intel Core MicroarchitectureIntel Core Microarchitecture New with the Penryn FamilyNew with the Penryn Family

Improved Store Forwarding

Higher bus speeds

Intel SSE4 instructions

Super Shuffle Engine

Deep Power Down Technology

Enhanced Intel Dynamic Acceleration Tech

Fast Radix-16 Divider

Faster OS Primitive Support

Enhanced Intel Virtualization Technology

Intel® Wide Dynamic Execution

Intel® Advanced Smart Cache

Intel® Smart Memory Access

Intel® Advanced Digital Media Boost

Intel® Intelligent Power Capability

Larger Caches: up to 12MB

24 Way Set Associativity

Increased Performance and Energy Efficiency Across Applications

Covered briefly in this presentation Focus of this presentation

6

IntelIntel®®

SSESSEIntelIntel®®

SSE4SSE4

IntelIntel®®

SSE2SSE2

1999 2000

IntelIntel®®

SSE3SSE3

2004

IntelIntel®®

SSSE3SSSE3

2006

200770

instructions

�Single-PrecisionVectors,

�streamingoperations

144instructions

�Double-precisionVectors

�128-bitvectorinteger

13instructions

�ComplexArithmetic

32instructions

� Decode47 instructions

� Video Accelerators

� Graphics building blocks

� Coprocessor Accelerators

�� Penryn supports 47 new instructionsPenryn supports 47 new instructions

�� Largest set of new instructions introduced since 2000Largest set of new instructions introduced since 2000

�� This continues a trend set by SSE, SSE2, SSE3, and SSSE3This continues a trend set by SSE, SSE2, SSE3, and SSSE3

�� Addresses some popular requests for key functionalityAddresses some popular requests for key functionality�� Features to improve the compilerFeatures to improve the compiler’’s ability to efficiently vectorize codes ability to efficiently vectorize code

�� Video Encode acceleration functionsVideo Encode acceleration functions

�� Floating-point dot product operation (3D content)Floating-point dot product operation (3D content)

�� Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)Streaming load for high b/w to WC memory (imaging, GPU-CPU sharing)

New Intel® SSE4 Instructions

7

Super Shuffle Engine

. . .

Input data (2 x 16 bytes)

. . .

Output data (16 bytes)

src1

src2

logic op

� Shuffle operations required for SSE

data formatting operations

� Unpacking

� Packing

� Align concatenated sources

� Wide shifts

� Insertion and extraction

� Setup for horizontal arithmetic

functions

� Penryn super shuffle engine

performs 128 bit operation in a

single cycle

� No software changes required

Super

Shuffle

engine

. . .logic op

. . .

B0 B1 B2 B15. . .B3 B4 B14

B0 B1 B2 B15. . .B3 B14B4

B0 B1 B2 B15. . .B3 B14B4

dest

Doubles shuffle throughput

8

New Radix-16 divider architecture

� Leading edge divider performance

� >50% speed-up over previous generations on average

� Innovative radix-16 based architecture

� Double pumped: processes 2x2 bits per cycle

� Hybrid 68-bit CSA/CPA based for faster pipelined

quotient select

� Optimized square root

Pre-processing

hybrid 68badder

QSL

hybrid 68badder

QSL

Post-processing

Quotient/remainder

Dividend Divisor

<<2

<<2

EP

square

root

DP

square

root

SP

square

root

EP

divide

DP

divide

SP divide32-bit

int

divide

Clo

cks

Penryn Previous generation

Maximum Processor Instruction Latency(lower is better)

9

Streaming Load Instruction� 16 byte Aligned load instruction on WC (write combining) memory

� Improves read bandwidth from WC mem by reading cache-line size quantities

GFX

High BW Writes

But … Low BW Reads

Graphics Mem

GFX

DD

R3

DD

R3

Fast Streaming Reads!

800 MB/s (partial reads) 800 MB/s (partial reads)

Up to 6.4 GB/s (cache line reads)Up to 6.4 GB/s (cache line reads)

Wit

h S

tream

ing

Lo

ad

sExis

tin

gP

latf

orm

s

45nm Next

Generation

Core ™

Core ™

Streaming Load is ~8X faster reading from WC Memory

Temp Buffer inside

L1 Data Cache

A B C D

L2 Cache

Bus

4th Access3rd Access

2nd Access1st Access

(Data usage in processor core)

DD

R3

DD

R3

10

Penryn builds on the capabilities in the Core™ microarchitecture

� Fine grained clock gating during runtime: Applied all across the chip

� Sleep transistors: In large arrays such as L2 cache to reduce leakage power

� C-states: Various low power states during CPU idle

� P-states: Multiple performance states for run-time energy efficiency

� Thermal Monitor: Self-managed thermal control based on multiple thermal sensors

� Digital Thermal Sensor: For software access to CPU temperature

� Intelligent Voltage Regulator control: To optimize VR efficiency during low loads

� PECI interface: For side-band access to temperature for fan control

Key New Mobile Power/Thermal Features in Penryn

� Deep Power Down Technology (DPD): A radically new and innovative

idle power management state

� Enhanced Dynamic Acceleration Technology (EDAT): A deterministic

and (TDP) power-neutral performance boost for Single Threaded apps

Power Management Overview

11

Deep Power Down Technology Overview

Vcc: << Retention

Core clk: OFF

PLL: OFF

Caches: OFF

State in DPDState in DPDFlexible C-States to Select Idle Power Level vs. Responsiveness

Core voltage*

Core clock

PLL

L1 caches

L2 cache

Wakeup time*

Idle power*

C0 C1 C3 C4

off

Active state

off

f lushedflushed

off

off

f lushedflushed

off

partial flushpartial flush

active

* Rough approximation

Deep Power Down

technology state

off

off

off

off

New

fo

r Penry

n

Deep Power Down

technology state

off

off

off

off

New

fo

r Penry

n

Concept: When idle, required CPU state is saved in a power-preserved domain onchip and then power is shut off

Exit: Upon a break event, Power/Clocks are turned on, State is restored andexecution resumes seamlessly from where it left off transparent to OS / applications

� Can occur few 100s of times per second (between keystrokes)

DPD enables reaching lower limit of CPU idle power of 0 W

� Lowest Leakage CPUstate you can get to!

� Agnostic to min Vccstate retention issues

Key benefits of DPDKey benefits of DPD

12

Penryn DPD Implementation Overview

8KB SRAMs

4 Major parts:

STATE STORAGE:

� 8KB (per core)

� Powered from I/O Vcc (VccP)

� ECC protected

STATE DEFINITION:

� What to include?

� Criteria: “Software seamless”

� Inclusions:� All Architectural state

� Most micro-architectural state

� Exclusions:� Temp registers used by ucode

� Some others on a case by case basis

VccP

VccP

MICROCODE:

� Does state save and restore

� Manages core synchronization

� Supports Auto-demote policy

PM UNIT (“hardware”):

� Coordinates cores at hardware level

� Manages the DPD Power up sequence

� Manages entry/exit protocol with platform

13

Deep Power Down Technology Entry/Exit

OS: Decides

to idle the

processors

OS: execute

MWAIT

instruction

CPU: shrink L2;

save uarch

state

CPU: signal

chipset to

enter DPD

Deep power

down state

Interrupt

break

event

CPU: Internal

RESET,

PLL relock etc

CPU: restore

Arch/uarch state,

Expand L2

CPU: continue

execution to

next instr.

Chipset

VRM

I/O

requests

FSB I/FVID control

Vcc

VccP

C/S: stops

CPU clk, blk I/O

CPU: VRM dn

� S/W instruction initiates processor DPD entry

� CPU does rest of sequencing with chipset/platform

� Protocol with chipset to block snoops due to DMAtraffic (no CPU wakeup required) while in DPD state

� Exit initiated by an break event in platform (interrupt)through the chipset

� CPU sequences through rest of exit sequence –including driving VID to VRM, internal hardwarereset, state restore and execution resumption

C/S: signals

CPU wakeup,

VRM ramp, bclk

Small amout

of logic is left

powered

up by VccP

during DPD

VccP

14

Interrupt Rate Sensitivity of Average Power

Deeper C-states have non-trivial energy cost for transitions

Today, OS makes C-state decisions based on latency, not energy cost

� Latency could be ok, but energy cost could be significantly different

Too frequent transitions into deep C-states could result in net energy loss

� Worse battery life

Penryn implements a “Auto-

demote” policy in the CPU to

address this issue� Heuristics to detects when energy

cost of DPD state entry and exit is

potentially > savings in DPD state

� Demotes such cases to shallower

C-state (C4), although the OS

requested DPD

� Silicon results prove that this was a

good decision!

Limit

Avera

ge P

ow

er

Example of Average Power MM02

1ms0.5ms 3ms2ms 5ms4ms 15ms10ms 50ms20ms

Interrupt Rate

with Deep Power Down technologywith C4

Deep Power Down crossover point: ~3-4ms

Source: Intel

15

DPD Results (Average Power)

MM05 – BAPCO* Mobile Mark* 2005 benchmark

� 27% to 44% (based on the leakage of the part) reduction on Average Power as

measured by Mobile Mark – Office Productivity benchmark due to DPD feature

� Significant improvement compared to previous generation (Merom)

� Measured Exit latency for DPD state: ~ 150 - 200 us => In expected range

ActualResults

measured inPenryn

Silicon – Upto 44%

reduction inAveragePower

**Config: Data from a sample distribution of parts with various leakage

measured on tester, under respective product POR Vcc, temperature etc

Source: Intel Corp, June 2007**

Average Power (MM05* - Office Productivity)

Leakage Current

Av

era

ge

Po

we

r

Penryn with DPD Disabled

Merom (Core (tm) 2 Duo processor)

Penryn with DPD Enabled

27%

44%

16

Enhanced Dynamic Acceleration Technology (EDAT)

Concept: In multi-core CPUs, use the power headroom of idle core to boostperformance of the non-idle core

How it works:

� When one core enters an idle power C-state (CC3 or deeper) AND

� OS requests a higher performance state on the running core,

� The non-idle core is boosted up to a higher voltage, higher frequency (EDAT freq)

� Overall chip Power still remains within the specified Thermal Design Power (TDP)

Min Vcc

TDP Top Freq.

CORE 0 CORE 1

Fre

qu

en

cy

2 cores active: Guaranteed

(marked) frequency

Min Vcc

TDP Top Freq.

EDAT Bin

CORE 0 CORE 1

Single core active: EDAT freq

Min Vcc

TDP Top Freq.

Core #2 at

C3-DPD

Leakage

P = P0 + P1 = TDP SpecP = P0 + P1 <= P (TDP Spec)

Fre

qu

en

cy

Applicability: EDAT isapplicable mostly tothermally constrainedplatforms such asmobile where theguaranteed frequencyis not max Vcc limited

EDAT provides a significant and predictable single-threaded performance boost

17

Max F/V

Max F/V

EDAT Implementation Overview

F/V

Clipping

LogicEDAT F/V

Guar. F/V

OS P-state Request “P[0]”

EDAT

Disable

EDAT

Control

Logic

EDAT Logic

Core

CC-state

Hysteresis

Mechanism

Microarchitecture� Entry based on OS request AND other core idle

� Idle core defined as “CC3” or deeper C-state

� EDAT Freq pre-programmed in chip based onpower, reliability and other constraints

� Exit EDAT mode when Idle core wakes up

Hysteresis mechanism� Allows short durations where 2 cores active

� Reduces perf loss for low activity wakeups

� Implemented using a few counters

� Voltage Regulator needs to provide for this

� Benefits most at high timer tick rates

OS interface� EDAT Freq exposed as {Guaranteed + epsilon}

P-state in ACPI table: P[0]

� OS requests P[0] state if perf demand exists

� EDAT logic grants it if power headroom exists

18

EDAT Results (Performance)

EDAT Performance on SPEC CPU2000 (Estimated)

0.96

0.98

1.00

1.02

1.04

1.06

1.08

Baseline: EDAT OFF EDAT ON; Low

Interrupt rate

EDAT ON; High

Interrupt rate

EDAT ON with

Hysteresis: High

Interrupt rate

Re

lati

ve

Pe

rfo

rma

nc

e

SPECint_base2000 (estimated)

SPECfp_base2000 (estimated)

� 5% (on SPECfp_base2000 (est*)) and 7% (on SPECint_base2000 (est*)) performance

gains due to EDAT within the same TDP power envelope!**

� High timer tick rates would have reduced the gains by a few % due to frequent wake up of

idle core and resulting exits from EDAT mode

� Intelligent hysteresis mechanism in Penryn recovers most of that loss

*Estimated SPEC metrics based on measurements on preproduction

platforms. SPEC, SPECint and SPECfp are trademarks of SPEC. For

more information on these benchmarks, see www.spec.org

**Configuration used for measurement: Matanzas CRB board, Crestline chipset DDR2-667

memory, Penryn A1 silicon @ 2.4 GHz, FSB 800, OS: Windows XP SP2

Source: Intel Corp, as of June 2007*

ActualResults

measured inPenryn

Silicon – Upto 7% boost

in ST perfdue to EDAT

19

New PM feature for Servers: CC3 State

L2 Cache

Core 0

Core 1L2 Lookup

cross-coresnoop

Core0 in C1ECore0 in CC3

L2 Cache

Core 0

Core 1

L2 Cache

Core 0

Core 1

L2 Cache

Core 0

Core 1

L2 Cache

Core 0

Core 1

Chipset Memory

DRAM access

L2snoop

Coresnoop

Coresnoop

ACTIVECores IDLE Cores (CC1)

L2snoop

Coresnoop

Coresnoop

IDLE Cores (CC1)

L2snoop

IDLE Cores (CC1)IDLE Cores (CC3) IDLE Cores (CC3) IDLE Cores (CC3)

Coresnoop

Coresnoop

Concept:

� Snoops burn ~30% of active core power

� By avoiding snoops into idle cores, we save power

How it works:

� Today, idle cores are put into Core C1 (CC1), which is a snoopable state

� In Penryn, idle cores can be put into Core C3 (CC3), which is a non-snoopable state.

� First level caches are flushed into the L2 cache before putting cores into CC3

� This prevents cross core snoops and therefore the additional power burnt for snoops

CC3 is a simple but effective optimization for power savings

20

CC3 Results (Power savings)

*Configuration used for measurement: Bridgeport2, DP Clovertown 2.33 GHz, 1333 MHz

Bus, 4x2GB FB DIMM 667 MHz, OS: Linux 2.6.18.1; Workload: Specweb99; DBS OFF;

Measured by mapping CC3 as CC1 in BIOS

** Note that the power savings due to this feature will be system

configuration, OS and workload dependent.

Power savings due to CC3 in Web Server workloads

-18.0%

-16.0%

-14.0%

-12.0%

-10.0%

-8.0%

-6.0%

-4.0%

-2.0%

0.0%

2.0%

0.0% 20.0% 40.0% 60.0% 80.0% 100.0%

% CPU utilization

% Perf loss

% CPU pwr reduction

Source: Intel Corp, as of June 2007*

� Up to ~16%** CPU power reduction measured on Web Server workload with

negligible performance loss with CC3 feature enabled

� Similar power reductions seen on SPEC CPU2000 (estimated by measurement on

preproduction silicon) (single user mode), other idle power benchmarks for Server

Up to 16% CPU

powerreduction

measured onweb server

workloads with

CC3 feature

21

Moore’s Law is alive and well with Penryn;Innovative features with tangible benefits being

introduced for Performance and Power

Summary

New generation Intel® Core™ microarchitecture is leading

the way with 45 nm process technology

New microarchitecture enhancements for higher

performance and improved energy efficiency

New SSE4 instructions for improved video, imaging, and 3D

content performance

New Power management features for improved battery life,

performance and energy efficiency

22

Acknowledgements

To the entire Penryn team across Folsom, Santa Clara,

Costa Rica and Israel and the Technology Development

Group for developing and delivering this product to market

23

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across differentprocessor families. See http://www.intel.com/products/processor_number for details.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel

products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers

should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information

on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/index.htm

Intel may make changes to specifications, release dates and product descriptions at any time, without notice. Intel, Pentium and the Intel logo aretrademarks or registered trademarks of the Intel Corporation or its subsidiaries in the United States and other countries.

Legal Notices and Important InformationRegarding the performance and power measurements in this presentation

*Other names and brands may be claimed as the property of others

45nm next generation intel® core™ microarchitecture (penryn) · 2013-07-28 · 45nm next...

Documents