01 intel processor architecture core

Intel® Core™ Microarchitecture

Intel® Software College

2

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.


Objectives

After completion of this module you will be able to describe

• Components of an IA processor

• Working flow of the instruction pipeline

• Notable features of the architecture

3





Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

Coding considerations

4





Agenda

Introduction


Notable features



5




Intel® Software CollegeIndustrial Recognition

PC Format May 2006PC Format May 2006““ Intel Strikes Back! Conroe is the name . Pistol. Pistol --whipping Athlon whipping Athlon 64s into burger meat is the game..64s into burger meat is the game.. ““

Intel Regains Performance Crown, Anandtech

“… At 2.8 or 3.0GHz, a Conroe EE would offer even stro nger performance than what we’ve seen here.”

Intel Reveals Conroe Architecture, Extremetech“… And not only was the Intel system running at 2.66GH z— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea f or a bit…”

Conroe Benchmarks - Intel Showing Big Strength Hot Hardware.com“… Intel is poised to change the face of the desktop computing landscape…”

Intel Dishes the Knockout Punch to AMD with Conroe , GD Hardware.com“… the results were far more than we could hope for an d it'll be amusing to see AMD's response to this beat-down ses sion

Intel's Next Generation Microarchitecture UnveiledIntel's Next Generation Microarchitecture UnveiledReal World Tech

“Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “

6





Performance Summary

Intel® Core™ Microarchitecture dramatically boosts Intel platform performance

• Conroe & Woodcrest drive clear Desktop/Server performance leadership

• Merom extends Intel Mobile performance leadership

Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the Multi-Core era

• Intel’s 3rd generation dual-core (while competition stuck on 1st

generation)

• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient

The “Core™ Effect”: Intel® Core™ Microarchitecture ramp fuels broad roadmap accelerations

Best Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: Energy----Efficient Performance Efficient Performance Efficient Performance Efficient Performance 1111

20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 !

1 Based on SPECint*_rate_base2000

7





Agenda

Introduction


• Architecture VS Microarchitecture

• CISC VS RISC

• Performance Measurements

• Pipeline Design

• Power and Energy

• Chip Multi-Processing

Notable features



8





Architecture and Micro-architecture

What is Computer Architecture?

• Architecture is the set of features which are externally visible:

• Instruction set

• Registers

• Addressing modes

• Bus protocols

Intel Architectures (IA)

• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)

• X87 (Floating Point extension)

• MMX (Multi-Media extension)

• SSE, SSE2, SSE3 (SIMD Streaming Extension)

• Intel® 64/EM64T (64-bit Integer extension of IA32)

• IA64 (Intel new 64-bit architecture)

• Itanium/Itainium2 processor family

? ? Go to detail!Go to detail!

9





Architecture and Micro-architecture (cont.)

What is Micro-architecture?

• Same as m–Architecture or u-Architecture

• “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC)

• Programs run faster F Improved Performance

• Reduced Power consumption F Extended Battery life

• H/W fits into Smaller Form Factor

10





Intel NetBurst ®P5 P6 Banias

Intel® Architecture History

Architecture:Instruction set definition and compatibility

EPIC* (Itanium ®) IA-32 IXA* (XScale)

Microarchitecture:Hardware implementation maintaining instruction set compatibility with high-level architecture

Processors:Productized implementation of Microarchitecture

Examples:

Examples:

Examples:

PentiumPentium ®® ProProPentiumPentium ®® II/IIIII/IIIPentiumPentium ®®

PentiumPentium ®® 44PentiumPentium ®® DD

XeonXeon ®®PentiumPentium ®® MM

* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing

11





Mobile Microarchitecture

Intel® NetBurst®

+ New Innovations

Intel® Core™ Microarchitecture Processors

IntelIntel®® CoreCore™™ 2 Duo/Quad/Extreme processors2 Duo/Quad/Extreme processors

12





RISC Approach to CPU design

Optimize H/W for common basic operations

• Fixed instruction length

• Shorter Execution Pipeline

• Ease of Instruction Level Parallelism

• Large number of registers

• Less memory accesses

• ‘Load/Store’ architecture

• Shorter Execution Pipeline

• Ease of advancing Loads

• Branch Hints

• Reduce pipeline flush events

• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support

• No ‘complex’ H/W instructions

• Handle exceptional conditions in S/W

Examples: MIPS, IBM Power and PowerPC, Sun Sparc

Achieve Maximum performance by right partitioning between H/W and S/W

(RISC = Reduced Instruction Set Computers)

13





CISC Approach to CPU design

Rich architecture

• Variable length instructions.

• Complex addressing modes.

On-chip HW / SW partitioning required

• H/W keeps executing ‘simple’ stuff

• Complex instructions are ‘emulated’ using u-code routines from ROM

• More instructions treated as ‘simple’ as more H/W is available

COMPATIBILITY has some major advantages:

• Large (and forever increasing) software base

• Code development tools

• Expertise

• H/W - S/W spiral

Example: Intel IA32, Motorola 680X0

Maximize information passed to the HW

(CISC = Complex Instruction Set Computers)

14





Performance is the reciprocal of the “Time of execution”:

Were:

L = Code Length (# of machine instructions)

CPI = Clock cycles Per Instruction

Tc = Clock period (nSecs)

Substitute:

IPC = Instructions Per Cycle = 1/CPI

F = Frequency = 1/Tc

CTCPILExecutionofTimeePerformanc

**

1

__

1 =≈

L

FIPCePerformanc

*≈

Improve Timing

Arch Enhancements

Improve ILP

Performance Measurement

15





Performance Measurement (cont.)

Performance considerations:

• Which Code/Application to run?

• Which OS?

• Which other components in the platform?

• Under which thermal conditions?

• Multithreading? Multiprocessing?

Benchmarks examples

• Industry Standard

• Spec (ISPEC, FSPEC)

• TPC

• Commercial

• SysMark

• MobileMark

• PCMark

• Sandra

• ScienceMark

• Applications

• Video (Windows Media encoder, DivX)

• Audio (Lame MP3)

• Compression (RAR)

• Content creation (3DSM, Photoshop, Premiere)

• Latest Games (Doom III, FarCry, but changes fast)

• Specific industries use specific benchmarks

• Linux compilation, POVRay, LinPack, lmbench

16





Design Considerations for Different Market Segments

Constrains:

• Thermally, area constrained � Desktop

• Unconstrained � Extreme

• Very area constrained � Value

• Thermally, Energy and Area constrained � Mobile

• Thermally, Energy � Servers

Micro-architecture is the Art of Tradeoffs between:

• Schedule

• Requirements / Standards

• Performance

• Features

• Power / Energy

• Area / Cost

17





Design Metrics

IPC = Instructions per Cycle

• The more the better

Latency – same as Response Time

• The time interval between

• when any request for data is made and

• when the data transfer completes

• The less the better

Throughput

• The amount of work completed by the system per unit of time.

• The more the better

• ops/sec

18





CPU Pipeline

Break the work to smaller pieces

• Four basic stages of instruction life

• Fetch - bring instruction to core

• Decode - read operands from register

• Execute - perform the operation

• Writeback - save result to register

• Execution timing of simple instructions

(legend: “op src1,src2 � dst”)

add eax, ebx �� eax F D E W

sub ecx, edx �� ecx F D E W

Increased throughput

• increased number of completed instructions per cycle

19





Pipeline Design - Explore Parallelism

New instruction not always depends on previous one

• Can start new instruction before previous one is finished

• ...if different stages use different H/W resources

Run instructions in parallel (pipeline)

Add eax, ebx �� eax F D E W

Sub ecx, edx �� ecx F D E W

Or edi, esi �� edi F D E W

Need to balance pipe stages

• Each stage should take same time for best throughput and utilization

ExecDecodeFetch WB

Clock cycle is determined by the longest path!

ExecDecodeFetch WBExecDecodeFetch WB

ExecDecodeFetch WB

20





Pipeline Design – Fighting Stalls

Data flow dependency (instructions output/input)

• Solved by bypasses, renaming etc

Control flow dependencies

• Solved by branch prediction

Others (Cache misses, long latency instructions)

• Solved by other dynamic scheduling techniques

? ? Go to detail!Go to detail!

21





Race of CISC vs. RISC

In modern CPUs Advanced µ-Architecture Techniques minimize the advantages of RISC over CISC

• Branch Prediction

• Reduces the effect of extra pipeline stages

• Register Renaming

• Effectively Increase the Number of Registers

• Out Of Order

• Reduce Number of stalls caused by shortage of registers

• Speculative Execution

• Further Reduce Number of stalls

• Power saving features

• Reduce the overhead when not needed.

22





µop – Intel’s Take of the CICS/RISC Race

(CISC) Instructions are translated into one or more (RISC) uop(micro-operation)s

• Fixed format

• Wide and simple

• Temp registers

Usually one uop per instruction

Complex instruction can be thousands of uops

Stores divided into two uops (STA and STD)

Fusion play games here

23





Power and Energy

Maximum power (TDP):

• � Cooling requirements

• � Cooling solution

• � Computer form factor and acoustic noise

Average power

• � Battery life

• � Electricity bill

General calculation:

• P = frequency * voltage^2 * activity factor * capacitance + leakage

Reducing TDP

• Less transistors and wires

• Smaller transistors and wires

• Power features � less activity

• Low leakage transistors

Reducing average power

• Energy efficiency

• Power states

• Lower leakage

24





Dual/Multi Core and SMT

Put more than one core per package

Architectural change:

• Software must be multi-threaded or multi-process

• …but backward compatible with multiprocessor systems (MP)

Several ways of implementing it

• All of them being used

Core

LLC

I/O

Core

LLC

I/O

Core

LLC

I/O

Core

LLC

Core

LLC

I/O

Core

SMT: Run two (or more) threads on the same core, simultaneously

25





Intel Approach

While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases

have come from the thread level parallelism.

While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases

have come from the thread level parallelism.

1 Threads1 Threads

IntelIntel ®®PentiumPentium ®®

2 Threads2 Threads

IntelIntel ®®PentiumPentium ®®With HTWith HT

IntelIntel ®®PentiumPentium ®® D D Processor Processor

2 Threads2 Threads

4 Threads4 Threads

2 Threads2 Threads

IntelIntel ®®Core 2 DuoCore 2 Duo ®®

IntelIntel®®

XQ6700*XQ6700*

Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006

StateExecution UnitsCacheBus

80 Threads80 Threads

?

26





A “Acronym Cheat Sheet” of Parallel Computing

CMP: Chip Multi Processor (two or more cores per package)

• Dual Core: two cores in same package

• Quad Core: four cores in same package

DP: Dual Processor (two packages)

MP: Multi Processor (four or more packages)

SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)

27





Agenda

Introduction


Notable features

• Wide Dynamic Execution

• Smart Memory Access

• Advanced Smart Cache

• Advanced Digital Media Boost

• Intelligent Power Capability



28





Intel® Core® Micro-architecture Notable FeaturesIntel® Wide Dynamic Execution

• 14-stage efficient pipeline• Wider execution path

• Advanced branch prediction

• Macro-fusion• Roughly ~15% of all instructions are conditional branches

• Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline

• Micro-fusion• Merges the load and operation micro-ops into one macro-op

• 64-Bit Support• Merom, Conroe, and Woodcrest

support EM64T

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

29





Intel® Core® Micro-architecture Notable Features (cont.)

Intel® Advanced Memory Access

• Improved prefetching

• Memory disambiguation

• Advance load before a possible data dependency (pointer conflict)

• Earlier loads hide memory latencies

30






Intel® Advanced Smart Cache

• Multi-core optimization

• Shared between the two cores

• Advanced Transfer Cache architecture

• Reduced bus traffic

• Both cores have full access to the entire cache

• Dynamic Cache sizing

31





Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache

CPU1 CPU2

Memory

Front Side Bus (FSB)

Cache Line

Shipping L2 Cache Line~Half access to memory

32





CPU2

Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache (cont.)

CPU1

Memory

Front Side Bus (FSB)

Cache Line

L2 is shared:No need to ship cacheline

33






Intel® Advanced Digital Media Boost

• Single Cycle SIMD Operation

• 8 Single Precision Flops/cycle

• 4 Double Precision Flops/cycle

• Wide Operations

• 128-bit packed Add

• 128-bit packed Multiply

• 128-bit packed Load

• 128-bit packed Store

• Support for Intel® EM64T instructions

CoreCore™™ µµµµµµµµarch arch

PreviousPrevious

X4X4

Y4Y4

X4opY4X4opY4

SOURCESOURCE

X1opY1X1opY1

X3X3

Y3Y3

X3opY3X3opY3

X2X2

Y2Y2

X2opY2X2opY2

X1X1

Y1Y1

X1opY1X1opY1

DESTDEST

SSE/2/3 OPSSE/2/3 OP

X2opY2X2opY2

X3opY3X3opY3X4opY4X4opY4

CLOCKCLOCK

CYCLE 1CYCLE 1

CLOCKCLOCK

CYCLE 2CYCLE 2

00127127

CLOCKCLOCK

CYCLE 1CYCLE 1

SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)

34





Intel® Core® Micro-architecture Notable Features

Intel® Advanced Digital Media Boost

• Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3)

• 16 new packed integer instructions

• Targeting video encode/decode

• Significantly improved strings

• REP MOVS and REP STOS

• ~8 bytes / cycle throughput

• mileage may vary

35





Intel® Core® Micro-architecture Notable FeaturesIntel® Advanced Digital Media Boost

• Supplemental SSE-3 (SSSE-3)

Packed SIGN

Packed Shuffle Bytes

Packed multiply High with Round and Scale

Multiply and Add Packed Signed/Unsigned bytes

Packed Align Right

Packed Absolute Values

Horizontal Addition/Subtraction

PSIGNB/W/D

PSHUFB

PMULHRSW

PALIGNR

PMADDUBSW

PABSB, PABSW, PABSD

PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD

36






Intelligent Power Capability

• Advanced power gating & Dynamic power coordination

• Multi-point demand-based switching

• Voltage-Frequency switching separation

• Supports transitions to deeper sleep modes

• Event blocking

• Clock partitioning and recovery

• Dynamic Bus Parking

• During periods of high performance execution, many parts of the chip core can be shut off

37





Agenda

Introduction


Notable features


• Front End

• Out-Of-Order Execution Core

• Memory Sub-system


38





Intel® Core® Micro-architecture Drill-down

icachebranch

predictionunit

instructionqueue

MS

instructiondecode

predecode

registeralias table

ALLOC Re-Order Buffer

ReservationStation

integerFP

SIMD(3x)

load

storeaddress

storedata

memoryorderbuffer

datacacheunit

page miss handler

39





Agenda

Introduction

Knowledge refreshment

Notable features


• Front End




40





Core® Micro-architecture Front End

Instruction preparation before executed

• Instruction Fetch Unit

• Instruction Queue

• Instruction Decode Unit

• Branch Prediction Unit

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

41





Instruction QueueIntel® Core™ Microarchitecture – Front End

Buffer between instruction pre-decode unit and decoder

• up to six predecoded instructions written per cycle

• 18 Instructions contained in IQ

• up to 5 Instructions read from IQ

Potential Loop cache

Loop Stream Detector (LSD) support

• Re-use of decoded instruction

• Potential power saving

42





Macro - Fusion

Roughly ~15% of all instructions are conditional branches.

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Not supported in EM64T long mode

Intel® Core™ Microarchitecture – Front End

cmpjae eax, [mem], label

Scheduler

Execution

flags and target to Write back

BranchEval

43





Instruction Queue

addps xmm0, [EAX+16]

dec0

Cycle 2

Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0

movps [EAX+240], xmm0


cmp eax, 100000

dec1

dec2

dec3

jge label


Macro-Fusion Absent

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

Enabling Example

for (int i=0; i<100000; i++) {

…

}

cmp eax, 100000

jge label

dec0


44





Instruction Queue


dec0Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0



cmpjae eax, 100000, label

dec1

dec2

dec3


Macro-Fusion Presented

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

Enabling Example

for (unsigned int i=0; i<100000; i++) {

…

}

cmp eax, 100000

jae label


45





Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline


46





std xmm0, [eax+240]

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240


st xmm0, [eax+240]

47





Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Branch miss-predictions reduced by >20%

Indirect Branch Predictor Loop Detector


48





Agenda

Introduction


Notable features


• Front End




49





Core® Micro-architecture Execution Core

Accepted decoded u-ops, assign resources, execute and retire u-ops

• Renamer

• Reservation station (RS)

• Issue ports

• Execution Unit

integerFP

SIMD(3x)

load

storeaddress

storedata

registeralias table


ReservationStation

50





Execution Core Building Blocks

Ports (number)Ports (number)

2 Load2 Load3,4 Store3,4 Store

Memory SubMemory Sub--systemsystem

0,1,50,1,5SIMDSIMD

IntegerInteger

SIMD/IntegerSIMD/IntegerMULMUL

0,1,50,1,5IntegerInteger

0,1,50,1,5FloatingFloating

PointPointExecution UnitExecution Unit

ROBROB

RenamerRenamer

RSRS

Intel® Core™ Microarchitecture – Execution Core

51





Issue Ports and Execution Units

6 dispatch ports from RS

• 3 execution ports

• (shared for integer / fp / simd)

• load

• store (address)

• store (data)

128-bit SSE implementation

• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

• Port 1 has packed add (3 cycles all precisions)


52





Retirement Unit

ReOrder Buffer (ROB)

• Holds micro-ops in various stages of completion

• Buffers completed micro-ops

• updates the architectural state in order

• manages ordering of exceptions


registeralias table


ReservationStation

53





Agenda

Introduction


Notable features


• Front End




54





Core® Micro-architecture Memory Sub-System

Memory Ordering Buffer

• Store Address Buffer• Stores the address of each store not actually performed• Loads compare address to any store older than itself

• If it find a hole…

• Store Data Buffer• Stores data of each store not actually performed• If load hit on the SAB, it forward the data from here

• Load Buffer• Stores address of non-retired loads• For snoops and re-dispatch

• One 128-bit load and one 128-bit store per cycle to different memory locations

• Out of order Memory operations

55





Core® Micro-architecture Memory Sub-System (cont.)32k D-Cache (8-way, 64 byte line size)

Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache

Cache to cache transfer

• improves producer / consumer style MP

Wider interface to L2

• reduced interference

• processor line fill is 2 cycles

Higher bandwidth from the L2 cache to the core

• ~14 clock latency and 2 clock throughput

Load & Store Access order1. L1 cache of immediate core

2. L1 cache of the other core

3. L2 cache

4. Memory

BusBusBusBusBusBusBusBus

2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache

Core1Core1Core1Core1Core1Core1Core1Core1 Core2Core2Core2Core2Core2Core2Core2Core2

Intel® Core™ Microarchitecture – Memory Sub-system

56





Advanced Memory Access / Enhanced Data Pre-fetch Logic

Speculates the next needed data and loads it into cache by HW and/or SW

Main Parking Lot (External Memory)

Valet Parking Area (L2 Cache)


Door(L1 Cache)

57





Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.)• L1D cache prefetching• Data Cache Unit Prefetcher

• Known as the streaming prefetcher• Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache

• Instruction Based Stride Prefetcher• Prefetches based upon a load having a regular stride• Can prefetch forward or backward 2 Kbytes• 1/2 default page size

• L2 cache prefetching: Data Prefetch Logic (DPL)• Prefetches data to the 2nd level cache before the DCU requests

the data• Maintains 2 tables for tracking loads

• Upstream – 16 entries• Downstream – 4 entries

• Every load is either found in the DPL or generates a new entry• Upon recognition of the 2nd load of a “stream” the DPL will

prefetch the next load


58





Advanced Memory Access / Memory Disambiguation


Memory Disambiguation predictor

• Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible

• increasing the performance of OOO memory pipelines

Disambiguated loads checked at retirement

• Extension to existing coherency mechanism

• Invisible to software and system

59





Advanced Memory Access / Memory Disambiguation Absent


Load4 must WAIT until previous stores complete

Memory

Data Y

Data Z

Data W

Data X

Load2 Y

Store3 W

Store1 Y

Load4 X

60





Advanced Memory Access / Memory Disambiguation Presented


Loads can decouple from stores

Load4 can get its data WITHOUT waiting for stores

Memory

Data Y

Data Z

Data W

Data X

Load2 Y

Store3 W

Store1 Y

Load4 X

61





Advanced Memory Access / Stores Forwarding

If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load


Memory

Data Y

Load2 Y

Store1 YInternal Buffers

62





Advanced Memory Access / Stores Forwarding: Aligned Store Cases

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16

load 32 bit load 32 bit load 32 bit load 32 bit

load 64 bit load 64 bit

load 128 bit

store 128 bit

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16 load 16 load 16 load 16

load 32 bit load 32 bit

load 64 bit

store 64 bit

ld 8 ld 8 ld 8 ld 8

load 16 load 16

load 32 bit

store 32 bit

ld 8 ld 8

load 16

store 16

63





Advanced Memory Access / Stores Forwarding: Unaligned Cases

Note that unaligned store forward does not occur when the loadcrosses a cache line boundary

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16‡ load 16 load 16 load 16

load 32 bit‡ load 32 bit

load 64 bit

store 64 bit

ld 8 ld 8 ld 8 ld 8

load 16‡ load 16

load 32 bit‡

store 32 bit

ld 8 ld 8

load 16‡

store 16

ld 8

ld 8 Store forwarded to load

No forwarding

‡: No forwarding if the loadcrosses a cache line boundary

Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding

64





Agenda

Introduction


Notable features



65





Optimizing forInstruction Fetch and PreDecode

Avoid “Length Changing Prefixes” (LCPs)

• Affects instructions with immediate data or offset

• Operand Size Override (66H)

• Address Size Override (67H) [obsolete]

• LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary)

• The REX (EM64T) prefix (4xH) is not an LCP

• The REX prefix does lengthen the instruction by one byte, so useof the first eight general registers in EM64T is preferred

66





Optimizing forInstruction Queue

Includes a “Loop Stream Detector” (LSD)

• Potentially very high bandwidth instruction streaming

• A number of requirements to make use of the LSD

• Maximum of 18 instructions in up to four 16-byte packets

• No RET instructions (hence, little practical use for CALLs)

• Up to four taken branches allowed

• Most effective at 70+ iterations

• LSD is after PreDecode so there is no added cost for LCPs

• Trade-off LSD with conventional loop unrolling

67





Optimizing forDecode

Decoder issues up to 4 uOps for renaming/ allocation per clock

• This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps

• For example, a single four uOp instruction is all that can be renamed/allocated in a single clock

• In some cases, multiple simple instructions may be a better choice than a single complex instruction

• Single uOp instructions allow more decoder flexibility

• For example, 4-1-1-1 can be decoded in one clock

• However, 2-2-2-1 takes three clocks to decode

68





Optimizing forExecution

Up to six uOps can be dispatched per clock

• “Store Data” and “Store Address” dispatch ports are combined on the block diagram

Up to four results can be written back per clock

Single clock latency operations are best

• Differing latency operations can create writeback conflicts

• Separate multiple-clock uOps with several single uOp instructions• Typical instructions here: ADC/SBB, RWM, CMOVcc

• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility)

When equivalent, PS preferred to PD (LCP)

• For example, MOVAPS over MOVAPD, XORPS over XORPD

69





Optimizing forExecution (cont.)

Bypass register “access” preferred to register reads

Partial register accesses often lead to stalls

• Register size access that ‘conflicts’ with recent previous register write

• Partial XMM updates subject to dependency delays

• Partial flag stall can occur, too � much higher cost• Use TEST instruction between shift and conditional to prevent

• Common zeroing instructions (e.g., XOR reg,reg) don’t stall

Avoid bypass between execution domains

• For example: FP (ADDPS) and logical ops (PAND) on XMMn

Vectorization: careful packing/unpacking sequence

• Use MXCSR’s FZ and DAZ controls as appropriate

70





Optimizing forMemory

Software prefetch instructions

• Can reach beyond a page boundary (including page walk)

• Prefetches only when it completes without an exception

General techniques to help these prefetchers

• Organize data in consecutive lines

• In general, increasing addresses are more easily prefetched

71





Summary

What has been covered

• Notable features of Core® Micro-architecture

• Wide Dynamic Execution

• Advanced Memory Access

• Advanced Smart Cache

• Advanced Digital Media Boost

• Power Efficient Support

• Core® Micro-architecture components

• Front End

• OOO execution core

• Memory sub-system

72





73





Platform

Intel provides most of the silicon on any computer

Classical platform partition

• CPU – Computation

• MCH – high speed IO

• ICH – low speed IO

Graphics speed and memory latencies will require different partition

This presentation focuses on the core microarchitecture

PCI (IO)SATAUSB

KBRDothers

FSB

FSB

ICH

Legacy & Debug I/O

Core

Core

LLC

MEHD video

PCIeDisplay

PEG

Analog

DMI

DMIMCH

CPU

MEMDDR

TVout

Graphics

Wireless

74





Intel® 64 = Extending IA-32 to 64 Bit

Added to Intel XEONAdded to Intel XEON ™™ and Pentiumand Pentium ®® 4 Processor in 2004; today 4 Processor in 2004; today available in all main stream Intel IAavailable in all main stream Intel IA --32 processors 32 processors –– in particular in in particular in

all processors based on Intelall processors based on Intel ®® CoreCore ™™ ArchitectureArchitecture

Additional Registers8-SSE & 8-Gen PurposeAdditional RegistersAdditional Registers

88--SSE & 8SSE & 8--GenGen PurposePurpose

Double Precision (64-bit) Integer Support

Double Precision (64Double Precision (64 --bit) bit) Integer SupportInteger Support

Extended Memory Addressability

64-Bit Pointers, Registers

Extended Memory Extended Memory AddressabilityAddressability

6464--Bit Pointers, RegistersBit Pointers, Registers

++ ==With 64With 64 --Bit Bit Extension Extension

TechnologyTechnology

75





Intel® 64 - New Modes of Operation

16

1616

16

32

32

64

GPR Widt

h

32

32

64

Addr Size

Defaults

32

32

32

Operand Size

No

No

Yes

New Regs

No

No

Yes

RIP Rel.

No

Yes

Yes

64-bit IP

New Features

No

Legacy 32-bit or 16-bit

OS

Legacy Mode

(IA32 Mode)

NoCompatibility

Mode

Yes

New 64-bit

OS

64-bit Mode

Long Mode

Compile required

OS Req’d

Mode

76





Registers : Extensions and Additions

R8

R9

R10

R11

R12

R13

R14

R15

ESPRSP

EDIRDI

ESIRSI

EBPRBP

EDXRDX

ECXRCX

EBXRBX

EAXRAX

63 32 31 0

XMM15

XMM14

XMM13

XMM12

XMM11

XMM10

XMM9

XMM8

XMM7

XMM6

XMM5

XMM4

XMM3

XMM2

XMM1

XMM0

EIPRIP

127 64 63 0

079

X87/MMX

77





Registers : Availability in different modes

78





64-bit Mode of Operation

Default data size is 32-bits

• Override to 64-bits using new REX prefix

All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable

REX prefixes

• A family of 16 prefixed, encoded 0x40-0x4F

• Allows the use of general purpose registers as 64-bits

• Allows the use of new registers (like r8-r15)

Instructions that set a 32 bit register automatically zero extend the upper 32-bits

79





REX Prefix

A new instruction-prefix byte used in 64-bit mode

• Specify the new GPRs and SSE registers

• Specify a 64-bit operand size.

• Specify extended control registers (used by system software)

An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix .

The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix.

80





Physical and Linear Addressing

Linear Addressing

• Initial Intel® 64 implementation support 48 bits of Virtual addressing.

• Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0.

Physical Addressing

• Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least

• Entries in page tables expanded for up to 52 bits of physical address.

81





Intel®64 - Large Memory Considerations

Canonical addressing for 64 bit addresses

• Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits

• Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros

• Canonical addresses are a requirement

• Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers

ReturnReturn

82





Introducing SIMD: Single Instruction Multiple Data

++

Scalar processing

• traditional mode

• one operation producesone result

SIMD processing

• with SSE / SSE2

• one operation produces

multiple results

XX

YY

X + YX + Y

++

x3x3 x2x2 x1x1 x0x0

y3y3 y2y2 y1y1 y0y0

x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0

XX

YY

X + YX + Y




SSE Registers

128

�� Eight 128Eight 128 --bit registersbit registers

�� Hold data only:Hold data only:

�� 4 x single FP numbers4 x single FP numbers

�� 2 x double FP numbers2 x double FP numbers

�� 128128--bit packed integersbit packed integers

�� Direct access to the registersDirect access to the registers

�� Use simultaneously with FP / Use simultaneously with FP / MMX TechnologyMMX Technology

MMX™ Technology / IA-FP Registers

8064

�� Eight 80/64Eight 80/64 --bit registersbit registers

�� Hold data onlyHold data only

�� Stack access to FP0..FP7Stack access to FP0..FP7

�� Direct access to MM0..MM7Direct access to MM0..MM7

�� No MMXNo MMX™™ Technology / FP Technology / FP interoperabilityinteroperability

IA-INT Registers

32

�� Fourteen 32Fourteen 32 --bit registersbit registers

�� Scalar data & addressesScalar data & addresses

�� Direct access to Direct access to regsregs

X86 Register SetsSSE-Registers introduced first in Pentium® 3

mm0mm0

mm7mm7

xmm0xmm0

xmm7xmm7

st0st0

st7st7

eaxeax

ediedi

……

84





Beginning in 2008: ~50 new instructions in 13 groups

All function in 32-bit and 64-bit modes

Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance

New Instructions Added to Intel® Processors

5670

144

13

32

50

0

20

40

60

80

100

120

140

160

Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+

MMX™ Streaming SIMDExtensions (SSE)

Streaming SIMDExtensions 2 (SSE2)

Streaming SIMDExtensions 3 (SSE3)

Supplemental SSE3(SSSE3)

Future Intel instructionset extensions

350 250 180 90 65 45Process (nm)

~

Instruction Set Extensions

32

FutureSSE-4

45 nm

85





SSE and SSE-2 Data Types

4x floats4x floatsSSE

16x bytes16x bytes

8x 168x 16--bit shortsbit shorts

4x 324x 32--bit integersbit integers

2x 642x 64--bit integersbit integers

1x 1281x 128--bit(!) integerbit(!) integer

2x doubles2x doubles

SSE-2


2001 PTE Engineering Enabling Conference



SSE-Instructions Set Extensions

Introduced by Pentium® 3 in 1999; now frequently called SSE-1

Only new data type supported: 4x32Bit (Single Precision) floating point data

Some 70 instructions

• Arithmetic, compare, convert operations on SSE SP FP data• PACKED, UNPACKED

• Data load/store

• Prefetch

• Extension of MMX

• Streaming Store (store without using cache in between)

• …

87





SSE Sample: Branch Removal

R = (R = (AA < < BB)? )? CC : : DD //remember: everything packed

0.00.0A

B

0.00.0 --3.03.0 3.03.0

0.00.0 1.01.0 --5.05.0 5.05.0

cmpltcmplt

0000000000 1111111111 0000000000 1111111111

andandc3c3 c2c2 c1c1 c0c0

0000000000 c2c2 0000000000 c0c0

nandnandd3d3 d2d2 d1d1 d0d0

d3d3 0000000000 d1d1 0000000000

oror

d3d3 c2c2 d1d1 c0c0





SSE-2 Instructions Set Extensions

Introduced by Intel® Pentium®4 processor in 2000

Some 140 new instructions

Added double precision floating point data (2x64Bit) and all related instructions including conversion

Again some extensions to MMX

Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations




SIMD Single vs. SIMD Double

002222232330303131

SIMD SP FP Operand = 4 Elements

Element = SP FP Number

005151525262626363

SIMD DP FP Operand = 2 Elements

Element = DP FP Number

4 x Single Precision:4 x Single Precision:SSESSE--11

2 x Double Precision:2 x Double Precision:SSESSE--2 2

X3X3 X2X2 X1X1 X0X0

SS ExponentExponent SignificandSignificand

X1X1 X0X0

SS ExponentExponent SignificandSignificand

00127127

127127 00

90





Sample for SSE-2: SIMD Double ↔↔↔↔ SIMD Int Conversion

SIMD Double � SIMD Int: conversion to two lower ints, two higher ints cleared

x1x1 x0x0

0000000000 0000000000 (int)x1(int)x1 (int)x0(int)x0

__m128d x;__m128i ix;ix = _mm_cvtpd_epi32(x);

???????? ???????? ix1ix1 ix0ix0

(double)x1(double)x1 (double)x0(double)x0

x = _mm_cvtepi32_pd(ix);

�� SIMD SIMD IntInt �� SIMD Double: conversion from SIMD Double: conversion from two lower two lower intintss

91





SIMD FP using AOS format*

Thread Synchronization

Video encoding

Complex arithmetic

FP to integer conversions

HADDPD, HSUBPD

HADDPS, HSUBPS

MONITOR, MWAIT

LDDQU

ADDSUBPD, ADDSUBPS,

MOVDDUP, MOVSHDUP,

MOVSLDUP

FISTTP

* Also benefits Complex and Vectorization

SSE3: No new Data Types but new Instructions

92





Streaming SIMD Extensions 313 new instructions

Three have limited use for application performance improvement

• FISTTP - X87 to integer conversion (requires –longdouble switch)

• MONITOR/MWAIT - thread synchronization

• Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages

The other ten have some potential for specifcapplication domains

93





SSE-3 Sample Complex Arithmetic: ADDSUBPS

ADDSUBPS OperandA OperandB

• OperandA (xmm register; 4 data elements)

• a3, a2, a1, a0

• OperandB (xmm reg. Or memory addr; 4 data elements)

• b3, b2, b1, b0

• Result (Stored in OperandA)

• a3+b3, a2-b2, a1+b1, a0-b0

__m128 _mm_addsub_ps(__m128 a, __m128 b)

a3 a2 a1 a0

a3+b3 a2-b2 a1+b1 a0-b0

Add Sub

b3 b2 b1 b0

AddSub

94





Sample SSSE-3 Inst.: Byte Permute

PSHUFB mm, mm/m64

PSHUFB xmm, xmm/m128• A complete byte-granularity permutation

• The source operand is used as the control field (variable control)

• The destination operand gets permuted

• Each byte of the source field selects the origin of the corresponding destination byte

• Also includes force-byte-to-zero flag (bit 7)

0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01

0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00

0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01

srcsrc

destdest

destdest





Ways to SSE/SIMD programming

Coding using SSE/SSE2/3/4 assembler instructions• Very tedious (manually schedule) – discouraged: Don’t do it !

• E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?

Intel® compiler’s C/C++ SIMD intrinsics• No need to take care of register allocation, scheduling etc

Intel® compiler’s C++ Vector Class Library• Use this if you are heavy into C++ classes

Vectorizer of Intel® C++ and Fortran Compilers• Recommended for most cases – easy and efficient

Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL)

96




Intel® Software CollegeCompiler Based VectorizationProcessor Specific

-xP,-axP

Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3

-xN-axN

Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2- depreciated switch: use xW instead

-axK-axK

Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE

-xW-axW

Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2

-xT,-axT

Intel® processors with MNI capability – Intel® Core™2 Duo processors (

Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE-3 and MNI

-xB-axB

Pentium® M processors including code generation for MMX, SSE and SSE-2

Linux*Generate Code and Optimize for

97





Intel® Core® Micro-architecture Notable Features (cont.) New Instructions

Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and store them to the dst register.

PALIGNR mm, mm/m64, imm8

PALIGNR xmm, xmm/m128, imm8

A complete byte-granularity permutation, including force-to-zero flag.

PSHUFB mm, mm/m64

PSHUFB xmm, xmm/m128

Signed 16 bits multiply, return high bits.PMULHRSW mm, mm/m64

PMULHRSW xmm, xmm/m128

Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate)

PMADDUBSW mm, mm/m64

PMADDUBSW xmm, xmm/m128

Pairwise integer horizontal subtract + pack.phsubw/d/sw mm, mm/m64

phsubw/d/sw xmm, xmm/m128

Pairwise integer horizontal addition + pack.phaddw/d/sw mm, mm/m64

phaddw/d/sw xmm, xmm/m128

Per element, overwrite destination with absolute value of source.

pabsb/w/d mm, mm/m64

pabsb/w/d xmm, xmm/m128

Per element, if the source operand is negative, multiply the destination operand by -1.

psignb/w/d mm, mm/m64

psignb/w/d xmm, xmm/m128

DescriptionInstruction name

ReturnReturn

98





Dependencies and Bypasses

“Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through

add eax, ecx �� eax F D E W

sub ebx, eax �� ebx F D D E W

“E to D” Bypass - save clock penalty

add eax, ecx �� eax F D E W

sub ebx, eax �� ebx F D E W

Long Latency operations

Load [ecx+edi] �� eax F D E E E W

add ebx, eax �� ebx F D D D E W

99





Fighting Stalls: Branch Handling

Given the code:

for (i=100, a=0; i>0; i--) a+=B[i];

Compiler would generate

• // eax initiated with zero, edi initiated with 100

loop: load B[edi] �� ebx // read B[i] from memory

add eax, ebx �� eax // a+=B[i]

add edi,-1 �� edi // i-=1

jnz edi, loop

store eax �� a // store result

100





Fighting Stalls: Branch Handling (cont.)

load B[edi] �� ebx F D E W

add eax,ebx �� eax F D E W

add edi,-1 �� edi F D E W

jnz edi, loop F D E W

store eax �� a F D E W

xxx F D E W

load B[edi] �� ebx F D E W

Only after branch Execute stage we know that next fetch was wrong

• Need to flush the pipe

• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1)

• ‘Pipe break’ penalty = 2 clocks

• Adding a stage?: IPC = 0.57 ~14% slower!!!Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem!

101





Fighting Stalls: Branch Handling (cont.)

H/W can ‘learn’ about SW behavior• Same branch goes same direction in most cases• Learn branch address and target• Branch Target Buffer (BTB)

• Predict based on branch history, surrounding branch behavior, loop behavior.• We are at ~95% correct prediction.

• Looks in BTB while fetching instruction• Lee&Smith or Yeh&Patt algorithmsNew (and correct) pointer calculated in Fetch stage of branch

load B[edi] �� ebx F D E Wadd eax,ebx �� eax F D E Wadd edi,-1 �� edi F D E Wjnz edi, loop F/P D E Wload B[edi] �� ebx F D E W

102





Advanced Pipeline Techniques

Limitations of the Typical Pipeline Scheme

• IPC is theoretically limited by 1

• Actually IPC is less than 1 because of long latency operations,stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.

• Pipeline stages are frequently not balanced

• Cycle Time (Tc) is determined by the longest pipeline stage

Advanced Pipeline Techniques

• Super pipeline

• Super-scalar

103





Advanced Pipeline Techniques (cont.)

Super pipeline: shorter stages allows higher frequency

Super-scalar: perform more in a single cycle

F1 F2 D1 D2 E1 E2 W1 W2F1 F2 D1 D2 E1 E2 W1 W2

F1 F2 D1 D2 E1 E2 W1 W2

F D E WF D E W

F D E WF D E W

104





Fighting stalls: Out Of Order Execution (OoO)

Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm )

1. Instruction Fetch and Decode.

2. Instruction queue @ Reservation Station.

3. Instruction

• waits in the queue until all input operands are available

• leaves the queue before earlier, older instructions.

4. Instruction Execution

5. Results are queued.

6. Instruction Reorder and Writeback.

Avoid the stall thatoccurs on this

stage in an in-orderprocessor

105





Fighting stalls: Register Renaming

Creates new opportunities for OOO execution

• Eliminates Write-after-write (WAW) and Write-after-read (WAR) dependencies = hazards.

Architectural vs physical registers dispatch

MULTD F4,F2,F2 reads from F2

ADDD F2,F0,F6 writes to F2

MULTD F4,F2,F2

ADDD F8,F0,F6 (assume F8 is unused)

1.1. movmov eaxeax,, [m1][m1]

2.2. aadddd eaxeax, 2, 2

3.3. movmov [m2], [m2], eaxeax

4.4. movmov eaxeax,, [[mm3]3]

5.5. aadddd eaxeax, 4, 4

6.6. movmov [m4], [m4], eaxeax

4, 5, 6 4, 5, 6 cancan bebe executedexecuted inin parallelparallel withwith 1, 2, 31, 2, 3but after registers renaming only!!!

106





Fighting Stalls: Re-Order Buffer (ROB)

Mechanism for renaming and retirement

Table contains in-order instructions order instructions

• Instructions are entered in order

• Registers renamed by the entry number

• Once assigned: execution order unimportant

• After execution: entries marked

• An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired -

• Update “real registers real registers” with value of renamed regs

• Update memory

• Leave the ROB

107





Fighting Stalls: Reservation Station(s)

Pool(s) of all “not yet executed” instructions

Maintains operands status “ready / not-ready”

Each cycle, executed instructions make more operands “ready”

Instructions whose all operands are “ready” can be “dispatched”for execution

Dispatcher chooses which of the “ready” instructions will be executed next

108





Fighting Stalls: Memory Order Buffer (MOB)

Idea - allow out of order among memory operations

Problem Memory dependencies cannot fully resolved statically (memory disambiguation)

Structure similar in concept to ROB

Every access is allocated an entry

Address & data (for stores) are updated when known

Load is checked against all previous stores: Load is checked against all previous stores

ReturnReturn

109





Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)

Many buses are sized for worst case data

(x86 instruction of 15 bytes)(ALU can write-back 128 bits)

Improved Energy Efficiency

110





By splitting buses to dealwith varying data widths,

we can gain the performancebenefit of bus width while

maintaining C dynamiccloser to thinner buses

Improved Energy Efficiency

Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)

111





Agenda

Introduction


Notable features

Micro-architecture drill-down

• Front End




112





System Bus

2nd Level Cache 1st Level Cache (Data)

Bus Unit

Decode/IQ

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

Branch Prediction Unit

Intel® Core® Micro-architecture Overview

Front EndFront EndExecution CoreExecution Core

113





Intel® Core® Micro-architecture Drill-down

icachebranch

predictionunit

instructionqueue

MS

instructiondecode

predecode

registeralias table


ReservationStation

integerFP

SIMD(3x)

load

storeaddress

storedata

memoryorderbuffer

datacacheunit

page miss handler

114





Example Code to Be Used

…


mulps xmm0, xmm0


cmp EAX, 100000

jge label

…

115





Agenda

Introduction


Notable features


• Front End




116






Instruction preparation before executed

• Instruction Fetch Unit

• Instruction Queue

• Instruction Decode Unit

• Branch Prediction Unit

117







Instruction Queue

Instruction Decode Unit


118






Prefetches instructions that are likely to be executed

Caches frequently-used instructions

Predecodes and Buffers instructions



IQ/ Decode


Execution Unit


Scheduler

BTBs/Branch Prediction


branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

119





Instruction Fetch Unit (cont.)

I-Cache (Instruction Cache)

• 32 KBytes / 8-way / 64-byte line

• 16 aligned bytes fetched per cycle

ITLB (Instruction Translation Lookaside Buffer)

• 128 4k pages, 8 2M pages

Instruction Prefetcher

• 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers

Instruction Pre-decoder

• Instruction Length Decode (predecode)• Avoid Length Changing Prefix, for example

• The REX (EM64T) prefix (4xH) is not an LCP


Avoid in loop:

MOV dx, 1234h

Instruction Prefixes (66H/67H)Instruction Prefixes (66H/67H) OpcodeOpcode ModRModR/M/M SIBSIB DisplacementDisplacement ImmediateImmediate

120







Instruction Queue



121





Instruction Queue

Buffer between instruction pre-decode unit and decoder

• up to six predecoded instructions written per cycle

• 18 Instructions contained in IQ

• up to 5 Instructions read from IQ

Potential Loop cache

Loop Stream Detector (LSD) support

• Re-use of decoded instruction

• Potential power saving



IQ/ Decode


Execution Unit


Scheduler



branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

122







Instruction Queue



123





Instruction Decode

Decode the instructions into micro-ops

Ready for the execution in OOO core



IQ/ Decode


Execution Unit


Scheduler



branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

124





Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion

• Stack Pointer Tracking


125





Instruction Decode / Decoders

Instructions converted to micro-ops (uops)

• 1-uop includes load+op, stores, indirect jump, RET...

4 decoders:1 “large” and 3 “small”

• All decoders handle “simple” 1-uop instructions

• One large decoder handles instructions up to 4 uops

All decoder working in parallel

• Four(+) instructions / cycle

Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex)


126





Code Sequence in Front End

these instructions tookmore than one fetchas they are 22 bytes

IQ buffers them together

all instructions are decodable by all decoders

CMP and adjacent JCCare “fused” into a singleuop. up to 5 instructions decoded per cycle


cmpcmp EAX, 100000 EAX, 100000

jnejne labellabel

movpsmovps [EAX+240], xmm0[EAX+240], xmm0

mulpsmulps xmm0, xmm0xmm0, xmm0addpsaddps xmm0, [EAX+16]xmm0, [EAX+16]

Large(dec0)

small(dec1)

small(dec2)

small(dec3)

cmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

IQ

127





Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion



128





Instruction Decode / Macro - Fusion

Roughly ~15% of all instructions are conditional branches.

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Not supported in EM64T long mode


cmpjae eax, [mem], label

Scheduler

Execution

flags and target to Write back

BranchEval

129





Instruction Queue


dec0

Cycle 2

Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0



cmp eax, 100000

dec1

dec2

dec3

jge label


Instruction Decode / Macro-Fusion Absent

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

Enabling Example

for (int i=0; i<100000; i++) {

…

}

cmp eax, 100000

jge label

dec0


130





Instruction Queue


dec0Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0



cmpjae eax, 100000, label

dec1

dec2

dec3


Instruction Decode / Macro-Fusion Presented

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

Enabling Example

for (unsigned int i=0; i<100000; i++) {

…

}

cmp eax, 100000

jae label


131





Instruction Decode / Macro – Fusion (cont.)

Benefits

• Reduces latency

• Increased renaming

• Increased retire bandwidth

• Increased virtual storage

• Power savings

Enabling Greater Performance & Enabling Greater Performance &

EfficiencyEfficiency


132





Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion



133





Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline


134





std xmm0, [eax+240]

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240


st xmm0, [eax+240]

135





Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion



136





Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding)

ESP is calculated by dedicate logic

• No explicit Micro-Ops updating ESP

• Micro-Ops saving

• Power savingESPd=8

Decoder

0

Decoder

1

Decoder

N

4

PUSH EAX PUSH EDX POP EBX

0

Recovery

Information

.

.

.

…


137







Instruction Queue



138






Allow executing instructions long before the branch outcome is decided

• Superset of Prescott / Pentium-M features

• One taken branch every other clock

• Branch predictions for 32 bytes at a time, twice the width of the fetch engine



IQ/ Decode


Execution Unit


Scheduler



branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

139





Branch Prediction Unit (cont.)

16-entry Return Stack Buffer (RSB)

Front end queuing of BPU lookups

Type of predictions

• Direct Calls and Jumps

• Indirect Calls and Jumps

• Conditional branches


140





Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Branch miss-predictions reduced by >20%

Indirect Branch Predictor Loop Detector


141





Agenda

Introduction


Notable features


• Front End




142





Core® Micro-architecture Execution Core


IQ/ Decode


Execution Unit


Scheduler


Front EndFront End Execution CoreExecution Core

Accepted decoded u-ops, assign resources, execute and retire u-ops

• Renamer

• Reservation station (RS)

• Issue ports

• Execution Unit

integerFP

SIMD(3x)

load

storeaddress

storedata

registeralias table


ReservationStation

143





Execution Core Building Blocks

Ports (number)Ports (number)

2 Load2 Load3,4 Store3,4 Store

Memory SubMemory Sub--systemsystem

0,1,50,1,5SIMDSIMD

IntegerInteger

SIMD/IntegerSIMD/IntegerMULMUL

0,1,50,1,5IntegerInteger

0,1,50,1,5FloatingFloating

PointPointExecution UnitExecution Unit

ROBROB

RenamerRenamer

RSRS


144





Rename and Resources

4 uops renamed / retired per clock

• one taken branch, any # of untaken

• one fxchg per cycle

Uops written to RS and ROB

• Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS

• RS waits for sources to arrive allowing OOO execution

• Registers not “in flight” read from ROB during RS write


registeralias table


ReservationStation

145





Issue Ports and Execution Units

6 dispatch ports from RS

• 3 execution ports • (shared for integer / fp / simd)

• load

• store (address)

• store (data)

128-bit SSE implementation

• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

• Port 1 has packed add (3 cycles all precisions)

FP data has one additional cycle bypass latency

• Do not mix SSE FP and SSE integer ops on same register


integerFP

SIMD(3x)

load

storeaddress

storedata

Avoid: Addps XMM0,XMM1Pand xmm0,xmm3Addps xmm2,xmm0

Better: Addps XMM0,XMM1Addps xmm2,xmm0Pand xmm0,xmm3

146





The Out Of Order

each uop only takes a single RS entry

load + add dispatches twice (load, then add)

mulps dispatches once when load + add to write back

sta + std dispatches twice

sta (address) can fire as early as possible

std must wait for mulps to write back

cmpjne dispatches only once (functionality is truly fused)

no dependency, can fire as early as it wants



RS

147





Dispatching to OOO EXEIntel® Core™ Microarchitecture – Execution Core

RScmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]



cmpjne EAX, 100000, labelsta_std [EAX+24C], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

5 GP (incl jmp)

4 STD

3 STA

2 Load

1 GP (incl FP add)

0 GP (incl FP mul)

148





Advanced Memory Access

3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2

Miss Latencies

• L1 miss hits L2 ~ 10 cycles

• L2 miss, access to memory ~300 cycles (server/FBD)

• L2 miss, access to memory ~165 cycles (Desk/DDR2)• C step broadwater is reported to have ~50ns latency

Cache Bandwidth

• Bandwidth to cache ~ 8.5 bytes/cycle

Memory Bandwidth

• Desktop ~ 6 GB/sec/socket (linux)

• Server ~3.5 GB/sec/socket


149





Optimizing for Intel® Core™Microarchitecture

Use CMP = employ both Cores

• Go to multithreading!

Prefer SSE as much as possible. If you didn’t do it so far, vectorize the code now!!

• Intel Compiler has very good vectorization engine

Align data and data layout (sequential)

• To align use __declspec(align (16)) float a[1000];

150





Optimizing for Intel® Core™Microarchitecture (advanced)

Use Intel VTune™ Performance Analyzer for performance problems revealing

• CPI

• Specific CPU events for Core-arch:

RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc-see VTune help

151





Front End Issue DebuggingLook for Front End optimization only when code is FE bound

• Reservation station (RS) is the front end and allocation target

• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue• If there are no issues in the FE the RS should be full above 30% of the time

Front End typical issues:

• Code is too big to fit in the L1:• When L2_IFETCH.SELF.MESI happens every 10-15 instructions • Code that could have been with CPI 1 will be around 2• 14 cycles penalty for L1 demand miss

• Average instruction size above 6 bytes• Happens typically with SSE code and more with EM64T• Can have impact only in case of otherwise excellent CPI

• Code with length changing prefix issues (LCP) • Penalty of 6 cycles or more • Look at ILD_STALL VTune event

FrontFront--End should not be the bottleneck. End should not be the bottleneck.

Focus on Front End issues only if it is the issue. Focus on Front End issues only if it is the issue.

152





Execution micro architecture

The busiest port may determent the potential execution speed

Single clock latency operations are best

• Different latency operations can create writeback conflicts �Creating bubble in the port

Look at the dependency chains to see the potential parallelism

• Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports

• High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound

• The ROB has 96 entries

• High RESOURCE_STALLS.ROB_FULL percentage only if

• Code has long latency instructions (L2 misses)

• Other code can be executed while waiting

Execution stage: The key for good performance.Execution stage: The key for good performance.

Focus on port utilization and dependency chains Focus on port utilization and dependency chains

153





Execution micro architecture

The Divider is a big potential stall source

• DIV for the number Divide operations executed

• IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy

• Try to find some useful work to do in parallel with divide operations

Extra cycle latency for bypass betweenexecution domains

• For example: FP (ADDPS) and logicalops (PAND) on XMMn

• DELAYED_BYPASS.FP

• DELAYED_BYPASS.LOAD

• DELAYED_BYPASS.SIMD

Data Cache Unit

SIMD

Integer

integer /

SIMD

MUL

IntegerFloating

Point

load

store (address)

store (data)

dtlb

memory orderring

store forwarding

0,1,5 0,1,5 0,1,5

234

EXE

154





Enhancements and Optimization Opportunities

IP Prefetcher

• Prefetches stride loads associated with the same IP• Uses History table

• Use VTune events to identify misses when expected prefetches

Memory Disambiguation

• Predicts when OK to fire load before preceding stores with unknown address• Misprediction triggers Pipeline flash and load restart

• Disambiguation is temporarily disabled if frequently fails

• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address

• In case not to the same address:Possible reasons for not working: Address collision with other load(s)

155





4k Aliasing

• OOO engine can fire Load before preceding Store if not collides on the Store’s address• Address collision serializes execution

• Address checking uses only the last 12 bits (4K)• False blocking - if Load’s & Store’s addresses have 4KB offset

• e.g. accessing large, power of two, sized arrays in a loop

• Resolve 4K aliasing conflicts by changing memory layout

• VTune event LOAD_BLOCK.OVERLAP_STORE

Load block cases

• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched • Store address unknown - LOAD_BLOCK.STA

• Loads blocked by a preceding store with unknown address

• Store data unknown - LOAD_BLOCK.STD

• Loads blocked by a preceding store with unknown data

• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE• This includes mainly uncacheable loads and split loads (loads that cross the cache

line boundary)

Other Opportunities for Performance Gain in the memory sub-system