01 intel processor architecture core

155
Intel® Core™ Microarchitecture Intel ® Software College

Upload: sssuhas

Post on 06-May-2015

2.634 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: 01 intel processor architecture core

Intel® Core™ Microarchitecture

Intel® Software College

Page 2: 01 intel processor architecture core

2

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Objectives

After completion of this module you will be able to describe

• Components of an IA processor

• Working flow of the instruction pipeline

• Notable features of the architecture

Page 3: 01 intel processor architecture core

3

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

Coding considerations

Page 4: 01 intel processor architecture core

4

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

Coding considerations

Page 5: 01 intel processor architecture core

5

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software CollegeIndustrial Recognition

PC Format May 2006PC Format May 2006““ Intel Strikes Back! Conroe is the name . Pistol. Pistol --whipping Athlon whipping Athlon 64s into burger meat is the game..64s into burger meat is the game.. ““

Intel Regains Performance Crown, Anandtech

“… At 2.8 or 3.0GHz, a Conroe EE would offer even stro nger performance than what we’ve seen here.”

Intel Reveals Conroe Architecture, Extremetech“… And not only was the Intel system running at 2.66GH z— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea f or a bit…”

Conroe Benchmarks - Intel Showing Big Strength Hot Hardware.com“… Intel is poised to change the face of the desktop computing landscape…”

Intel Dishes the Knockout Punch to AMD with Conroe , GD Hardware.com“… the results were far more than we could hope for an d it'll be amusing to see AMD's response to this beat-down ses sion

Intel's Next Generation Microarchitecture UnveiledIntel's Next Generation Microarchitecture UnveiledReal World Tech

“Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “

Page 6: 01 intel processor architecture core

6

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance Summary

Intel® Core™ Microarchitecture dramatically boosts Intel platform performance

• Conroe & Woodcrest drive clear Desktop/Server performance leadership

• Merom extends Intel Mobile performance leadership

Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the Multi-Core era

• Intel’s 3rd generation dual-core (while competition stuck on 1st

generation)

• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient

The “Core™ Effect”: Intel® Core™ Microarchitecture ramp fuels broad roadmap accelerations

Best Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: Energy----Efficient Performance Efficient Performance Efficient Performance Efficient Performance 1111

20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 !

1 Based on SPECint*_rate_base2000

Page 7: 01 intel processor architecture core

7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

• Architecture VS Microarchitecture

• CISC VS RISC

• Performance Measurements

• Pipeline Design

• Power and Energy

• Chip Multi-Processing

Notable features

Micro-architecture tour

Coding considerations

Page 8: 01 intel processor architecture core

8

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Architecture and Micro-architecture

What is Computer Architecture?

• Architecture is the set of features which are externally visible:

• Instruction set

• Registers

• Addressing modes

• Bus protocols

Intel Architectures (IA)

• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)

• X87 (Floating Point extension)

• MMX (Multi-Media extension)

• SSE, SSE2, SSE3 (SIMD Streaming Extension)

• Intel® 64/EM64T (64-bit Integer extension of IA32)

• IA64 (Intel new 64-bit architecture)

• Itanium/Itainium2 processor family

? ? Go to detail!Go to detail!

Page 9: 01 intel processor architecture core

9

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Architecture and Micro-architecture (cont.)

What is Micro-architecture?

• Same as m–Architecture or u-Architecture

• “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC)

• Programs run faster F Improved Performance

• Reduced Power consumption F Extended Battery life

• H/W fits into Smaller Form Factor

Page 10: 01 intel processor architecture core

10

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel NetBurst ®P5 P6 Banias

Intel® Architecture History

Architecture:Instruction set definition and compatibility

EPIC* (Itanium ®) IA-32 IXA* (XScale)

Microarchitecture:Hardware implementation maintaining instruction set compatibility with high-level architecture

Processors:Productized implementation of Microarchitecture

Examples:

Examples:

Examples:

PentiumPentium ®® ProProPentiumPentium ®® II/IIIII/IIIPentiumPentium ®®

PentiumPentium ®® 44PentiumPentium ®® DD

XeonXeon ®®PentiumPentium ®® MM

* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing

Page 11: 01 intel processor architecture core

11

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Mobile Microarchitecture

Intel® NetBurst®

+ New Innovations

Intel® Core™ Microarchitecture Processors

IntelIntel®® CoreCore™™ 2 Duo/Quad/Extreme processors2 Duo/Quad/Extreme processors

Page 12: 01 intel processor architecture core

12

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

RISC Approach to CPU design

Optimize H/W for common basic operations

• Fixed instruction length

• Shorter Execution Pipeline

• Ease of Instruction Level Parallelism

• Large number of registers

• Less memory accesses

• ‘Load/Store’ architecture

• Shorter Execution Pipeline

• Ease of advancing Loads

• Branch Hints

• Reduce pipeline flush events

• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support

• No ‘complex’ H/W instructions

• Handle exceptional conditions in S/W

Examples: MIPS, IBM Power and PowerPC, Sun Sparc

Achieve Maximum performance by right partitioning between H/W and S/W

(RISC = Reduced Instruction Set Computers)

Page 13: 01 intel processor architecture core

13

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

CISC Approach to CPU design

Rich architecture

• Variable length instructions.

• Complex addressing modes.

On-chip HW / SW partitioning required

• H/W keeps executing ‘simple’ stuff

• Complex instructions are ‘emulated’ using u-code routines from ROM

• More instructions treated as ‘simple’ as more H/W is available

COMPATIBILITY has some major advantages:

• Large (and forever increasing) software base

• Code development tools

• Expertise

• H/W - S/W spiral

Example: Intel IA32, Motorola 680X0

Maximize information passed to the HW

(CISC = Complex Instruction Set Computers)

Page 14: 01 intel processor architecture core

14

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance is the reciprocal of the “Time of execution”:

Were:

L = Code Length (# of machine instructions)

CPI = Clock cycles Per Instruction

Tc = Clock period (nSecs)

Substitute:

IPC = Instructions Per Cycle = 1/CPI

F = Frequency = 1/Tc

CTCPILExecutionofTimeePerformanc

**

1

__

1 =≈

L

FIPCePerformanc

*≈

Improve Timing

Arch Enhancements

Improve ILP

Performance Measurement

Page 15: 01 intel processor architecture core

15

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Performance Measurement (cont.)

Performance considerations:

• Which Code/Application to run?

• Which OS?

• Which other components in the platform?

• Under which thermal conditions?

• Multithreading? Multiprocessing?

Benchmarks examples

• Industry Standard

• Spec (ISPEC, FSPEC)

• TPC

• Commercial

• SysMark

• MobileMark

• PCMark

• Sandra

• ScienceMark

• Applications

• Video (Windows Media encoder, DivX)

• Audio (Lame MP3)

• Compression (RAR)

• Content creation (3DSM, Photoshop, Premiere)

• Latest Games (Doom III, FarCry, but changes fast)

• Specific industries use specific benchmarks

• Linux compilation, POVRay, LinPack, lmbench

Page 16: 01 intel processor architecture core

16

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Design Considerations for Different Market Segments

Constrains:

• Thermally, area constrained � Desktop

• Unconstrained � Extreme

• Very area constrained � Value

• Thermally, Energy and Area constrained � Mobile

• Thermally, Energy � Servers

Micro-architecture is the Art of Tradeoffs between:

• Schedule

• Requirements / Standards

• Performance

• Features

• Power / Energy

• Area / Cost

Page 17: 01 intel processor architecture core

17

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Design Metrics

IPC = Instructions per Cycle

• The more the better

Latency – same as Response Time

• The time interval between

• when any request for data is made and

• when the data transfer completes

• The less the better

Throughput

• The amount of work completed by the system per unit of time.

• The more the better

• ops/sec

Page 18: 01 intel processor architecture core

18

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

CPU Pipeline

Break the work to smaller pieces

• Four basic stages of instruction life

• Fetch - bring instruction to core

• Decode - read operands from register

• Execute - perform the operation

• Writeback - save result to register

• Execution timing of simple instructions

(legend: “op src1,src2 � dst”)

add eax, ebx ���� eax F D E W

sub ecx, edx ���� ecx F D E W

Increased throughput

• increased number of completed instructions per cycle

Page 19: 01 intel processor architecture core

19

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Pipeline Design - Explore Parallelism

New instruction not always depends on previous one

• Can start new instruction before previous one is finished

• ...if different stages use different H/W resources

Run instructions in parallel (pipeline)

Add eax, ebx ���� eax F D E W

Sub ecx, edx ���� ecx F D E W

Or edi, esi ���� edi F D E W

Need to balance pipe stages

• Each stage should take same time for best throughput and utilization

ExecDecodeFetch WB

Clock cycle is determined by the longest path!

ExecDecodeFetch WBExecDecodeFetch WB

ExecDecodeFetch WB

Page 20: 01 intel processor architecture core

20

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Pipeline Design – Fighting Stalls

Data flow dependency (instructions output/input)

• Solved by bypasses, renaming etc

Control flow dependencies

• Solved by branch prediction

Others (Cache misses, long latency instructions)

• Solved by other dynamic scheduling techniques

? ? Go to detail!Go to detail!

Page 21: 01 intel processor architecture core

21

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Race of CISC vs. RISC

In modern CPUs Advanced µ-Architecture Techniques minimize the advantages of RISC over CISC

• Branch Prediction

• Reduces the effect of extra pipeline stages

• Register Renaming

• Effectively Increase the Number of Registers

• Out Of Order

• Reduce Number of stalls caused by shortage of registers

• Speculative Execution

• Further Reduce Number of stalls

• Power saving features

• Reduce the overhead when not needed.

Page 22: 01 intel processor architecture core

22

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

µop – Intel’s Take of the CICS/RISC Race

(CISC) Instructions are translated into one or more (RISC) uop(micro-operation)s

• Fixed format

• Wide and simple

• Temp registers

Usually one uop per instruction

Complex instruction can be thousands of uops

Stores divided into two uops (STA and STD)

Fusion play games here

Page 23: 01 intel processor architecture core

23

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Power and Energy

Maximum power (TDP):

• � Cooling requirements

• � Cooling solution

• � Computer form factor and acoustic noise

Average power

• � Battery life

• � Electricity bill

General calculation:

• P = frequency * voltage^2 * activity factor * capacitance + leakage

Reducing TDP

• Less transistors and wires

• Smaller transistors and wires

• Power features � less activity

• Low leakage transistors

Reducing average power

• Energy efficiency

• Power states

• Lower leakage

Page 24: 01 intel processor architecture core

24

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Dual/Multi Core and SMT

Put more than one core per package

Architectural change:

• Software must be multi-threaded or multi-process

• …but backward compatible with multiprocessor systems (MP)

Several ways of implementing it

• All of them being used

Core

LLC

I/O

Core

LLC

I/O

Core

LLC

I/O

Core

LLC

Core

LLC

I/O

Core

SMT: Run two (or more) threads on the same core, simultaneously

Page 25: 01 intel processor architecture core

25

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel Approach

While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases

have come from the thread level parallelism.

While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases

have come from the thread level parallelism.

1 Threads1 Threads

IntelIntel ®®PentiumPentium ®®

2 Threads2 Threads

IntelIntel ®®PentiumPentium ®®With HTWith HT

IntelIntel ®®PentiumPentium ®® D D Processor Processor

2 Threads2 Threads

4 Threads4 Threads

2 Threads2 Threads

IntelIntel ®®Core 2 DuoCore 2 Duo ®®

IntelIntel®®

XQ6700*XQ6700*

Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006

StateExecution UnitsCacheBus

80 Threads80 Threads

?

Page 26: 01 intel processor architecture core

26

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

A “Acronym Cheat Sheet” of Parallel Computing

CMP: Chip Multi Processor (two or more cores per package)

• Dual Core: two cores in same package

• Quad Core: four cores in same package

DP: Dual Processor (two packages)

MP: Multi Processor (four or more packages)

SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)

Page 27: 01 intel processor architecture core

27

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

• Wide Dynamic Execution

• Smart Memory Access

• Advanced Smart Cache

• Advanced Digital Media Boost

• Intelligent Power Capability

Micro-architecture tour

Coding considerations

Page 28: 01 intel processor architecture core

28

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable FeaturesIntel® Wide Dynamic Execution

• 14-stage efficient pipeline• Wider execution path

• Advanced branch prediction

• Macro-fusion• Roughly ~15% of all instructions are conditional branches

• Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline

• Micro-fusion• Merges the load and operation micro-ops into one macro-op

• 64-Bit Support• Merom, Conroe, and Woodcrest

support EM64T

2M/4Mshared L2

Cache

up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

LoadLoad

SchedulersSchedulers

Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)

ALUBranch

MMX/SSEFPmove

DecodeDecode

Rename/AllocRename/Alloc

uCodeuCodeROMROM

Instruction Fetch Instruction Fetch and and PreDecodePreDecode

ALUFAdd

MMX/SSEFPmove

ALUALUFMulFMul

MMX/SSEMMX/SSEFPmoveFPmove

Instruction QueueInstruction Queue

StoreStore

4444

4444

5555

Page 29: 01 intel processor architecture core

29

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)

Intel® Advanced Memory Access

• Improved prefetching

• Memory disambiguation

• Advance load before a possible data dependency (pointer conflict)

• Earlier loads hide memory latencies

Page 30: 01 intel processor architecture core

30

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)

Intel® Advanced Smart Cache

• Multi-core optimization

• Shared between the two cores

• Advanced Transfer Cache architecture

• Reduced bus traffic

• Both cores have full access to the entire cache

• Dynamic Cache sizing

Page 31: 01 intel processor architecture core

31

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache

CPU1 CPU2

Memory

Front Side Bus (FSB)

Cache Line

Shipping L2 Cache Line~Half access to memory

Page 32: 01 intel processor architecture core

32

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

CPU2

Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache (cont.)

CPU1

Memory

Front Side Bus (FSB)

Cache Line

L2 is shared:No need to ship cacheline

Page 33: 01 intel processor architecture core

33

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)

Intel® Advanced Digital Media Boost

• Single Cycle SIMD Operation

• 8 Single Precision Flops/cycle

• 4 Double Precision Flops/cycle

• Wide Operations

• 128-bit packed Add

• 128-bit packed Multiply

• 128-bit packed Load

• 128-bit packed Store

• Support for Intel® EM64T instructions

CoreCore™™ µµµµµµµµarch arch

PreviousPrevious

X4X4

Y4Y4

X4opY4X4opY4

SOURCESOURCE

X1opY1X1opY1

X3X3

Y3Y3

X3opY3X3opY3

X2X2

Y2Y2

X2opY2X2opY2

X1X1

Y1Y1

X1opY1X1opY1

DESTDEST

SSE/2/3 OPSSE/2/3 OP

X2opY2X2opY2

X3opY3X3opY3X4opY4X4opY4

CLOCKCLOCK

CYCLE 1CYCLE 1

CLOCKCLOCK

CYCLE 2CYCLE 2

00127127

CLOCKCLOCK

CYCLE 1CYCLE 1

SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)

Page 34: 01 intel processor architecture core

34

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features

Intel® Advanced Digital Media Boost

• Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3)

• 16 new packed integer instructions

• Targeting video encode/decode

• Significantly improved strings

• REP MOVS and REP STOS

• ~8 bytes / cycle throughput

• mileage may vary

Page 35: 01 intel processor architecture core

35

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable FeaturesIntel® Advanced Digital Media Boost

• Supplemental SSE-3 (SSSE-3)

Packed SIGN

Packed Shuffle Bytes

Packed multiply High with Round and Scale

Multiply and Add Packed Signed/Unsigned bytes

Packed Align Right

Packed Absolute Values

Horizontal Addition/Subtraction

PSIGNB/W/D

PSHUFB

PMULHRSW

PALIGNR

PMADDUBSW

PABSB, PABSW, PABSD

PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD

Page 36: 01 intel processor architecture core

36

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)

Intelligent Power Capability

• Advanced power gating & Dynamic power coordination

• Multi-point demand-based switching

• Voltage-Frequency switching separation

• Supports transitions to deeper sleep modes

• Event blocking

• Clock partitioning and recovery

• Dynamic Bus Parking

• During periods of high performance execution, many parts of the chip core can be shut off

Page 37: 01 intel processor architecture core

37

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 38: 01 intel processor architecture core

38

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Drill-down

icachebranch

predictionunit

instructionqueue

MS

instructiondecode

predecode

registeralias table

ALLOC Re-Order Buffer

ReservationStation

integerFP

SIMD(3x)

load

storeaddress

storedata

memoryorderbuffer

datacacheunit

page miss handler

Page 39: 01 intel processor architecture core

39

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge refreshment

Notable features

Micro-architecture tour

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 40: 01 intel processor architecture core

40

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction preparation before executed

• Instruction Fetch Unit

• Instruction Queue

• Instruction Decode Unit

• Branch Prediction Unit

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

Page 41: 01 intel processor architecture core

41

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction QueueIntel® Core™ Microarchitecture – Front End

Buffer between instruction pre-decode unit and decoder

• up to six predecoded instructions written per cycle

• 18 Instructions contained in IQ

• up to 5 Instructions read from IQ

Potential Loop cache

Loop Stream Detector (LSD) support

• Re-use of decoded instruction

• Potential power saving

Page 42: 01 intel processor architecture core

42

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Macro - Fusion

Roughly ~15% of all instructions are conditional branches.

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Not supported in EM64T long mode

Intel® Core™ Microarchitecture – Front End

cmpjae eax, [mem], label

Scheduler

Execution

flags and target to Write back

BranchEval

Page 43: 01 intel processor architecture core

43

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Queue

addps xmm0, [EAX+16]

dec0

Cycle 2

Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0

movps [EAX+240], xmm0

addps xmm0, [EAX+16]

cmp eax, 100000

dec1

dec2

dec3

jge label

movps [EAX+240], xmm0

Macro-Fusion Absent

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

Enabling Example

for (int i=0; i<100000; i++) {

}

cmp eax, 100000

jge label

dec0

Intel® Core™ Microarchitecture – Front End

Page 44: 01 intel processor architecture core

44

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Queue

addps xmm0, [EAX+16]

dec0Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0

movps [EAX+240], xmm0

addps xmm0, [EAX+16]

cmpjae eax, 100000, label

dec1

dec2

dec3

movps [EAX+240], xmm0

Macro-Fusion Presented

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

Enabling Example

for (unsigned int i=0; i<100000; i++) {

}

cmp eax, 100000

jae label

Intel® Core™ Microarchitecture – Front End

Page 45: 01 intel processor architecture core

45

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline

Intel® Core™ Microarchitecture – Front End

Page 46: 01 intel processor architecture core

46

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

std xmm0, [eax+240]

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240

Intel® Core™ Microarchitecture – Front End

st xmm0, [eax+240]

Page 47: 01 intel processor architecture core

47

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Branch miss-predictions reduced by >20%

Indirect Branch Predictor Loop Detector

Intel® Core™ Microarchitecture – Front End

Page 48: 01 intel processor architecture core

48

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 49: 01 intel processor architecture core

49

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Execution Core

Accepted decoded u-ops, assign resources, execute and retire u-ops

• Renamer

• Reservation station (RS)

• Issue ports

• Execution Unit

integerFP

SIMD(3x)

load

storeaddress

storedata

registeralias table

ALLOC Re-Order Buffer

ReservationStation

Page 50: 01 intel processor architecture core

50

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution Core Building Blocks

Ports (number)Ports (number)

2 Load2 Load3,4 Store3,4 Store

Memory SubMemory Sub--systemsystem

0,1,50,1,5SIMDSIMD

IntegerInteger

SIMD/IntegerSIMD/IntegerMULMUL

0,1,50,1,5IntegerInteger

0,1,50,1,5FloatingFloating

PointPointExecution UnitExecution Unit

ROBROB

RenamerRenamer

RSRS

Intel® Core™ Microarchitecture – Execution Core

Page 51: 01 intel processor architecture core

51

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Issue Ports and Execution Units

6 dispatch ports from RS

• 3 execution ports

• (shared for integer / fp / simd)

• load

• store (address)

• store (data)

128-bit SSE implementation

• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

• Port 1 has packed add (3 cycles all precisions)

Intel® Core™ Microarchitecture – Execution Core

Page 52: 01 intel processor architecture core

52

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Retirement Unit

ReOrder Buffer (ROB)

• Holds micro-ops in various stages of completion

• Buffers completed micro-ops

• updates the architectural state in order

• manages ordering of exceptions

Intel® Core™ Microarchitecture – Execution Core

registeralias table

ALLOC Re-Order Buffer

ReservationStation

Page 53: 01 intel processor architecture core

53

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 54: 01 intel processor architecture core

54

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Memory Sub-System

Memory Ordering Buffer

• Store Address Buffer• Stores the address of each store not actually performed• Loads compare address to any store older than itself

• If it find a hole…

• Store Data Buffer• Stores data of each store not actually performed• If load hit on the SAB, it forward the data from here

• Load Buffer• Stores address of non-retired loads• For snoops and re-dispatch

• One 128-bit load and one 128-bit store per cycle to different memory locations

• Out of order Memory operations

Page 55: 01 intel processor architecture core

55

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Memory Sub-System (cont.)32k D-Cache (8-way, 64 byte line size)

Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache

Cache to cache transfer

• improves producer / consumer style MP

Wider interface to L2

• reduced interference

• processor line fill is 2 cycles

Higher bandwidth from the L2 cache to the core

• ~14 clock latency and 2 clock throughput

Load & Store Access order1. L1 cache of immediate core

2. L1 cache of the other core

3. L2 cache

4. Memory

BusBusBusBusBusBusBusBus

2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache

Core1Core1Core1Core1Core1Core1Core1Core1 Core2Core2Core2Core2Core2Core2Core2Core2

Intel® Core™ Microarchitecture – Memory Sub-system

Page 56: 01 intel processor architecture core

56

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Enhanced Data Pre-fetch Logic

Speculates the next needed data and loads it into cache by HW and/or SW

Main Parking Lot (External Memory)

Valet Parking Area (L2 Cache)

Intel® Core™ Microarchitecture – Memory Sub-system

Door(L1 Cache)

Page 57: 01 intel processor architecture core

57

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.)• L1D cache prefetching• Data Cache Unit Prefetcher

• Known as the streaming prefetcher• Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache

• Instruction Based Stride Prefetcher• Prefetches based upon a load having a regular stride• Can prefetch forward or backward 2 Kbytes• 1/2 default page size

• L2 cache prefetching: Data Prefetch Logic (DPL)• Prefetches data to the 2nd level cache before the DCU requests

the data• Maintains 2 tables for tracking loads

• Upstream – 16 entries• Downstream – 4 entries

• Every load is either found in the DPL or generates a new entry• Upon recognition of the 2nd load of a “stream” the DPL will

prefetch the next load

Intel® Core™ Microarchitecture – Memory Sub-system

Page 58: 01 intel processor architecture core

58

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Memory Disambiguation

Intel® Core™ Microarchitecture – Memory Sub-system

Memory Disambiguation predictor

• Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible

• increasing the performance of OOO memory pipelines

Disambiguated loads checked at retirement

• Extension to existing coherency mechanism

• Invisible to software and system

Page 59: 01 intel processor architecture core

59

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Memory Disambiguation Absent

Intel® Core™ Microarchitecture – Memory Sub-system

Load4 must WAIT until previous stores complete

Memory

Data Y

Data Z

Data W

Data X

Load2 Y

Store3 W

Store1 Y

Load4 X

Page 60: 01 intel processor architecture core

60

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Memory Disambiguation Presented

Intel® Core™ Microarchitecture – Memory Sub-system

Loads can decouple from stores

Load4 can get its data WITHOUT waiting for stores

Memory

Data Y

Data Z

Data W

Data X

Load2 Y

Store3 W

Store1 Y

Load4 X

Page 61: 01 intel processor architecture core

61

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Stores Forwarding

If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load

Intel® Core™ Microarchitecture – Memory Sub-system

Memory

Data Y

Load2 Y

Store1 YInternal Buffers

Page 62: 01 intel processor architecture core

62

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Stores Forwarding: Aligned Store Cases

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16

load 32 bit load 32 bit load 32 bit load 32 bit

load 64 bit load 64 bit

load 128 bit

store 128 bit

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16 load 16 load 16 load 16

load 32 bit load 32 bit

load 64 bit

store 64 bit

ld 8 ld 8 ld 8 ld 8

load 16 load 16

load 32 bit

store 32 bit

ld 8 ld 8

load 16

store 16

Page 63: 01 intel processor architecture core

63

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access / Stores Forwarding: Unaligned Cases

Note that unaligned store forward does not occur when the loadcrosses a cache line boundary

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

load 16‡ load 16 load 16 load 16

load 32 bit‡ load 32 bit

load 64 bit

store 64 bit

ld 8 ld 8 ld 8 ld 8

load 16‡ load 16

load 32 bit‡

store 32 bit

ld 8 ld 8

load 16‡

store 16

ld 8

ld 8 Store forwarded to load

No forwarding

‡: No forwarding if the loadcrosses a cache line boundary

Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding

Page 64: 01 intel processor architecture core

64

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture tour

Coding considerations

Page 65: 01 intel processor architecture core

65

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forInstruction Fetch and PreDecode

Avoid “Length Changing Prefixes” (LCPs)

• Affects instructions with immediate data or offset

• Operand Size Override (66H)

• Address Size Override (67H) [obsolete]

• LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary)

• The REX (EM64T) prefix (4xH) is not an LCP

• The REX prefix does lengthen the instruction by one byte, so useof the first eight general registers in EM64T is preferred

Page 66: 01 intel processor architecture core

66

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forInstruction Queue

Includes a “Loop Stream Detector” (LSD)

• Potentially very high bandwidth instruction streaming

• A number of requirements to make use of the LSD

• Maximum of 18 instructions in up to four 16-byte packets

• No RET instructions (hence, little practical use for CALLs)

• Up to four taken branches allowed

• Most effective at 70+ iterations

• LSD is after PreDecode so there is no added cost for LCPs

• Trade-off LSD with conventional loop unrolling

Page 67: 01 intel processor architecture core

67

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forDecode

Decoder issues up to 4 uOps for renaming/ allocation per clock

• This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps

• For example, a single four uOp instruction is all that can be renamed/allocated in a single clock

• In some cases, multiple simple instructions may be a better choice than a single complex instruction

• Single uOp instructions allow more decoder flexibility

• For example, 4-1-1-1 can be decoded in one clock

• However, 2-2-2-1 takes three clocks to decode

Page 68: 01 intel processor architecture core

68

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forExecution

Up to six uOps can be dispatched per clock

• “Store Data” and “Store Address” dispatch ports are combined on the block diagram

Up to four results can be written back per clock

Single clock latency operations are best

• Differing latency operations can create writeback conflicts

• Separate multiple-clock uOps with several single uOp instructions• Typical instructions here: ADC/SBB, RWM, CMOVcc

• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility)

When equivalent, PS preferred to PD (LCP)

• For example, MOVAPS over MOVAPD, XORPS over XORPD

Page 69: 01 intel processor architecture core

69

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forExecution (cont.)

Bypass register “access” preferred to register reads

Partial register accesses often lead to stalls

• Register size access that ‘conflicts’ with recent previous register write

• Partial XMM updates subject to dependency delays

• Partial flag stall can occur, too � much higher cost• Use TEST instruction between shift and conditional to prevent

• Common zeroing instructions (e.g., XOR reg,reg) don’t stall

Avoid bypass between execution domains

• For example: FP (ADDPS) and logical ops (PAND) on XMMn

Vectorization: careful packing/unpacking sequence

• Use MXCSR’s FZ and DAZ controls as appropriate

Page 70: 01 intel processor architecture core

70

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing forMemory

Software prefetch instructions

• Can reach beyond a page boundary (including page walk)

• Prefetches only when it completes without an exception

General techniques to help these prefetchers

• Organize data in consecutive lines

• In general, increasing addresses are more easily prefetched

Page 71: 01 intel processor architecture core

71

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Summary

What has been covered

• Notable features of Core® Micro-architecture

• Wide Dynamic Execution

• Advanced Memory Access

• Advanced Smart Cache

• Advanced Digital Media Boost

• Power Efficient Support

• Core® Micro-architecture components

• Front End

• OOO execution core

• Memory sub-system

Page 72: 01 intel processor architecture core

72

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Page 73: 01 intel processor architecture core

73

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Platform

Intel provides most of the silicon on any computer

Classical platform partition

• CPU – Computation

• MCH – high speed IO

• ICH – low speed IO

Graphics speed and memory latencies will require different partition

This presentation focuses on the core microarchitecture

PCI (IO)SATAUSB

KBRDothers

FSB

FSB

ICH

Legacy & Debug I/O

Core

Core

LLC

MEHD video

PCIeDisplay

PEG

Analog

DMI

DMIMCH

CPU

MEMDDR

TVout

Graphics

Wireless

Page 74: 01 intel processor architecture core

74

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® 64 = Extending IA-32 to 64 Bit

Added to Intel XEONAdded to Intel XEON ™™ and Pentiumand Pentium ®® 4 Processor in 2004; today 4 Processor in 2004; today available in all main stream Intel IAavailable in all main stream Intel IA --32 processors 32 processors –– in particular in in particular in

all processors based on Intelall processors based on Intel ®® CoreCore ™™ ArchitectureArchitecture

Additional Registers8-SSE & 8-Gen PurposeAdditional RegistersAdditional Registers

88--SSE & 8SSE & 8--GenGen PurposePurpose

Double Precision (64-bit) Integer Support

Double Precision (64Double Precision (64 --bit) bit) Integer SupportInteger Support

Extended Memory Addressability

64-Bit Pointers, Registers

Extended Memory Extended Memory AddressabilityAddressability

6464--Bit Pointers, RegistersBit Pointers, Registers

++ ==With 64With 64 --Bit Bit Extension Extension

TechnologyTechnology

Page 75: 01 intel processor architecture core

75

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® 64 - New Modes of Operation

16

1616

16

32

32

64

GPR Widt

h

32

32

64

Addr Size

Defaults

32

32

32

Operand Size

No

No

Yes

New Regs

No

No

Yes

RIP Rel.

No

Yes

Yes

64-bit IP

New Features

No

Legacy 32-bit or 16-bit

OS

Legacy Mode

(IA32 Mode)

NoCompatibility

Mode

Yes

New 64-bit

OS

64-bit Mode

Long Mode

Compile required

OS Req’d

Mode

Page 76: 01 intel processor architecture core

76

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Registers : Extensions and Additions

R8

R9

R10

R11

R12

R13

R14

R15

ESPRSP

EDIRDI

ESIRSI

EBPRBP

EDXRDX

ECXRCX

EBXRBX

EAXRAX

63 32 31 0

XMM15

XMM14

XMM13

XMM12

XMM11

XMM10

XMM9

XMM8

XMM7

XMM6

XMM5

XMM4

XMM3

XMM2

XMM1

XMM0

EIPRIP

127 64 63 0

079

X87/MMX

Page 77: 01 intel processor architecture core

77

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Registers : Availability in different modes

Page 78: 01 intel processor architecture core

78

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

64-bit Mode of Operation

Default data size is 32-bits

• Override to 64-bits using new REX prefix

All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable

REX prefixes

• A family of 16 prefixed, encoded 0x40-0x4F

• Allows the use of general purpose registers as 64-bits

• Allows the use of new registers (like r8-r15)

Instructions that set a 32 bit register automatically zero extend the upper 32-bits

Page 79: 01 intel processor architecture core

79

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

REX Prefix

A new instruction-prefix byte used in 64-bit mode

• Specify the new GPRs and SSE registers

• Specify a 64-bit operand size.

• Specify extended control registers (used by system software)

An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix .

The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix.

Page 80: 01 intel processor architecture core

80

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Physical and Linear Addressing

Linear Addressing

• Initial Intel® 64 implementation support 48 bits of Virtual addressing.

• Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0.

Physical Addressing

• Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least

• Entries in page tables expanded for up to 52 bits of physical address.

Page 81: 01 intel processor architecture core

81

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel®64 - Large Memory Considerations

Canonical addressing for 64 bit addresses

• Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits

• Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros

• Canonical addresses are a requirement

• Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers

ReturnReturn

Page 82: 01 intel processor architecture core

82

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Introducing SIMD: Single Instruction Multiple Data

++

Scalar processing

• traditional mode

• one operation producesone result

SIMD processing

• with SSE / SSE2

• one operation produces

multiple results

XX

YY

X + YX + Y

++

x3x3 x2x2 x1x1 x0x0

y3y3 y2y2 y1y1 y0y0

x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0

XX

YY

X + YX + Y

Page 83: 01 intel processor architecture core

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE Registers

128

�� Eight 128Eight 128 --bit registersbit registers

�� Hold data only:Hold data only:

�� 4 x single FP numbers4 x single FP numbers

�� 2 x double FP numbers2 x double FP numbers

�� 128128--bit packed integersbit packed integers

�� Direct access to the registersDirect access to the registers

�� Use simultaneously with FP / Use simultaneously with FP / MMX TechnologyMMX Technology

MMX™ Technology / IA-FP Registers

8064

�� Eight 80/64Eight 80/64 --bit registersbit registers

�� Hold data onlyHold data only

�� Stack access to FP0..FP7Stack access to FP0..FP7

�� Direct access to MM0..MM7Direct access to MM0..MM7

�� No MMXNo MMX™™ Technology / FP Technology / FP interoperabilityinteroperability

IA-INT Registers

32

�� Fourteen 32Fourteen 32 --bit registersbit registers

�� Scalar data & addressesScalar data & addresses

�� Direct access to Direct access to regsregs

X86 Register SetsSSE-Registers introduced first in Pentium® 3

mm0mm0

mm7mm7

xmm0xmm0

xmm7xmm7

st0st0

st7st7

eaxeax

ediedi

……

Page 84: 01 intel processor architecture core

84

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Beginning in 2008: ~50 new instructions in 13 groups

All function in 32-bit and 64-bit modes

Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance

New Instructions Added to Intel® Processors

5670

144

13

32

50

0

20

40

60

80

100

120

140

160

Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+

MMX™ Streaming SIMDExtensions (SSE)

Streaming SIMDExtensions 2 (SSE2)

Streaming SIMDExtensions 3 (SSE3)

Supplemental SSE3(SSSE3)

Future Intel instructionset extensions

350 250 180 90 65 45Process (nm)

~

Instruction Set Extensions

32

FutureSSE-4

45 nm

Page 85: 01 intel processor architecture core

85

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE and SSE-2 Data Types

4x floats4x floatsSSE

16x bytes16x bytes

8x 168x 16--bit shortsbit shorts

4x 324x 32--bit integersbit integers

2x 642x 64--bit integersbit integers

1x 1281x 128--bit(!) integerbit(!) integer

2x doubles2x doubles

SSE-2

Page 86: 01 intel processor architecture core

Copyright © 2006, Intel Corporation. All rights reserved.

2001 PTE Engineering Enabling Conference

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-Instructions Set Extensions

Introduced by Pentium® 3 in 1999; now frequently called SSE-1

Only new data type supported: 4x32Bit (Single Precision) floating point data

Some 70 instructions

• Arithmetic, compare, convert operations on SSE SP FP data• PACKED, UNPACKED

• Data load/store

• Prefetch

• Extension of MMX

• Streaming Store (store without using cache in between)

• …

Page 87: 01 intel processor architecture core

87

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE Sample: Branch Removal

R = (R = (AA < < BB)? )? CC : : DD //remember: everything packed

0.00.0A

B

0.00.0 --3.03.0 3.03.0

0.00.0 1.01.0 --5.05.0 5.05.0

cmpltcmplt

0000000000 1111111111 0000000000 1111111111

andandc3c3 c2c2 c1c1 c0c0

0000000000 c2c2 0000000000 c0c0

nandnandd3d3 d2d2 d1d1 d0d0

d3d3 0000000000 d1d1 0000000000

oror

d3d3 c2c2 d1d1 c0c0

Page 88: 01 intel processor architecture core

Copyright © 2006, Intel Corporation. All rights reserved.

2001 PTE Engineering Enabling Conference

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-2 Instructions Set Extensions

Introduced by Intel® Pentium®4 processor in 2000

Some 140 new instructions

Added double precision floating point data (2x64Bit) and all related instructions including conversion

Again some extensions to MMX

Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations

Page 89: 01 intel processor architecture core

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SIMD Single vs. SIMD Double

002222232330303131

SIMD SP FP Operand = 4 Elements

Element = SP FP Number

005151525262626363

SIMD DP FP Operand = 2 Elements

Element = DP FP Number

4 x Single Precision:4 x Single Precision:SSESSE--11

2 x Double Precision:2 x Double Precision:SSESSE--2 2

X3X3 X2X2 X1X1 X0X0

SS ExponentExponent SignificandSignificand

X1X1 X0X0

SS ExponentExponent SignificandSignificand

00127127

127127 00

Page 90: 01 intel processor architecture core

90

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Sample for SSE-2: SIMD Double ↔↔↔↔ SIMD Int Conversion

SIMD Double � SIMD Int: conversion to two lower ints, two higher ints cleared

x1x1 x0x0

0000000000 0000000000 (int)x1(int)x1 (int)x0(int)x0

__m128d x;__m128i ix;ix = _mm_cvtpd_epi32(x);

???????? ???????? ix1ix1 ix0ix0

(double)x1(double)x1 (double)x0(double)x0

x = _mm_cvtepi32_pd(ix);

�� SIMD SIMD IntInt �������� SIMD Double: conversion from SIMD Double: conversion from two lower two lower intintss

Page 91: 01 intel processor architecture core

91

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SIMD FP using AOS format*

Thread Synchronization

Video encoding

Complex arithmetic

FP to integer conversions

HADDPD, HSUBPD

HADDPS, HSUBPS

MONITOR, MWAIT

LDDQU

ADDSUBPD, ADDSUBPS,

MOVDDUP, MOVSHDUP,

MOVSLDUP

FISTTP

* Also benefits Complex and Vectorization

SSE3: No new Data Types but new Instructions

Page 92: 01 intel processor architecture core

92

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Streaming SIMD Extensions 313 new instructions

Three have limited use for application performance improvement

• FISTTP - X87 to integer conversion (requires –longdouble switch)

• MONITOR/MWAIT - thread synchronization

• Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages

The other ten have some potential for specifcapplication domains

Page 93: 01 intel processor architecture core

93

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

SSE-3 Sample Complex Arithmetic: ADDSUBPS

ADDSUBPS OperandA OperandB

• OperandA (xmm register; 4 data elements)

• a3, a2, a1, a0

• OperandB (xmm reg. Or memory addr; 4 data elements)

• b3, b2, b1, b0

• Result (Stored in OperandA)

• a3+b3, a2-b2, a1+b1, a0-b0

__m128 _mm_addsub_ps(__m128 a, __m128 b)

a3 a2 a1 a0

a3+b3 a2-b2 a1+b1 a0-b0

Add Sub

b3 b2 b1 b0

AddSub

Page 94: 01 intel processor architecture core

94

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Sample SSSE-3 Inst.: Byte Permute

PSHUFB mm, mm/m64

PSHUFB xmm, xmm/m128• A complete byte-granularity permutation

• The source operand is used as the control field (variable control)

• The destination operand gets permuted

• Each byte of the source field selects the origin of the corresponding destination byte

• Also includes force-byte-to-zero flag (bit 7)

0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01

0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00

0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01

srcsrc

destdest

destdest

Page 95: 01 intel processor architecture core

Copyright © 2006, Intel Corporation. All rights reserved.

2001 PTE Engineering Enabling Conference

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Ways to SSE/SIMD programming

Coding using SSE/SSE2/3/4 assembler instructions• Very tedious (manually schedule) – discouraged: Don’t do it !

• E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?

Intel® compiler’s C/C++ SIMD intrinsics• No need to take care of register allocation, scheduling etc

Intel® compiler’s C++ Vector Class Library• Use this if you are heavy into C++ classes

Vectorizer of Intel® C++ and Fortran Compilers• Recommended for most cases – easy and efficient

Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL)

Page 96: 01 intel processor architecture core

96

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software CollegeCompiler Based VectorizationProcessor Specific

-xP,-axP

Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3

-xN-axN

Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2- depreciated switch: use xW instead

-axK-axK

Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE

-xW-axW

Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2

-xT,-axT

Intel® processors with MNI capability – Intel® Core™2 Duo processors (

Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE-3 and MNI

-xB-axB

Pentium® M processors including code generation for MMX, SSE and SSE-2

Linux*Generate Code and Optimize for

Page 97: 01 intel processor architecture core

97

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.) New Instructions

Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and store them to the dst register.

PALIGNR mm, mm/m64, imm8

PALIGNR xmm, xmm/m128, imm8

A complete byte-granularity permutation, including force-to-zero flag.

PSHUFB mm, mm/m64

PSHUFB xmm, xmm/m128

Signed 16 bits multiply, return high bits.PMULHRSW mm, mm/m64

PMULHRSW xmm, xmm/m128

Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate)

PMADDUBSW mm, mm/m64

PMADDUBSW xmm, xmm/m128

Pairwise integer horizontal subtract + pack.phsubw/d/sw mm, mm/m64

phsubw/d/sw xmm, xmm/m128

Pairwise integer horizontal addition + pack.phaddw/d/sw mm, mm/m64

phaddw/d/sw xmm, xmm/m128

Per element, overwrite destination with absolute value of source.

pabsb/w/d mm, mm/m64

pabsb/w/d xmm, xmm/m128

Per element, if the source operand is negative, multiply the destination operand by -1.

psignb/w/d mm, mm/m64

psignb/w/d xmm, xmm/m128

DescriptionInstruction name

ReturnReturn

Page 98: 01 intel processor architecture core

98

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Dependencies and Bypasses

“Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through

add eax, ecx ���� eax F D E W

sub ebx, eax ���� ebx F D D E W

“E to D” Bypass - save clock penalty

add eax, ecx ���� eax F D E W

sub ebx, eax ���� ebx F D E W

Long Latency operations

Load [ecx+edi] ���� eax F D E E E W

add ebx, eax ���� ebx F D D D E W

Page 99: 01 intel processor architecture core

99

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling

Given the code:

for (i=100, a=0; i>0; i--) a+=B[i];

Compiler would generate

• // eax initiated with zero, edi initiated with 100

loop: load B[edi] ���� ebx // read B[i] from memory

add eax, ebx ���� eax // a+=B[i]

add edi,-1 ���� edi // i-=1

jnz edi, loop

store eax ���� a // store result

Page 100: 01 intel processor architecture core

100

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling (cont.)

load B[edi] ���� ebx F D E W

add eax,ebx ���� eax F D E W

add edi,-1 ���� edi F D E W

jnz edi, loop F D E W

store eax ���� a F D E W

xxx F D E W

load B[edi] ���� ebx F D E W

Only after branch Execute stage we know that next fetch was wrong

• Need to flush the pipe

• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1)

• ‘Pipe break’ penalty = 2 clocks

• Adding a stage?: IPC = 0.57 ~14% slower!!!Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem!

Page 101: 01 intel processor architecture core

101

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Branch Handling (cont.)

H/W can ‘learn’ about SW behavior• Same branch goes same direction in most cases• Learn branch address and target• Branch Target Buffer (BTB)

• Predict based on branch history, surrounding branch behavior, loop behavior.• We are at ~95% correct prediction.

• Looks in BTB while fetching instruction• Lee&Smith or Yeh&Patt algorithmsNew (and correct) pointer calculated in Fetch stage of branch

load B[edi] ���� ebx F D E Wadd eax,ebx ���� eax F D E Wadd edi,-1 ���� edi F D E Wjnz edi, loop F/P D E Wload B[edi] ���� ebx F D E W

Page 102: 01 intel processor architecture core

102

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Pipeline Techniques

Limitations of the Typical Pipeline Scheme

• IPC is theoretically limited by 1

• Actually IPC is less than 1 because of long latency operations,stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.

• Pipeline stages are frequently not balanced

• Cycle Time (Tc) is determined by the longest pipeline stage

Advanced Pipeline Techniques

• Super pipeline

• Super-scalar

Page 103: 01 intel processor architecture core

103

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Pipeline Techniques (cont.)

Super pipeline: shorter stages allows higher frequency

Super-scalar: perform more in a single cycle

F1 F2 D1 D2 E1 E2 W1 W2F1 F2 D1 D2 E1 E2 W1 W2

F1 F2 D1 D2 E1 E2 W1 W2

F D E WF D E W

F D E WF D E W

Page 104: 01 intel processor architecture core

104

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting stalls: Out Of Order Execution (OoO)

Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm )

1. Instruction Fetch and Decode.

2. Instruction queue @ Reservation Station.

3. Instruction

• waits in the queue until all input operands are available

• leaves the queue before earlier, older instructions.

4. Instruction Execution

5. Results are queued.

6. Instruction Reorder and Writeback.

Avoid the stall thatoccurs on this

stage in an in-orderprocessor

Page 105: 01 intel processor architecture core

105

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting stalls: Register Renaming

Creates new opportunities for OOO execution

• Eliminates Write-after-write (WAW) and Write-after-read (WAR) dependencies = hazards.

Architectural vs physical registers dispatch

MULTD F4,F2,F2 reads from F2

ADDD F2,F0,F6 writes to F2

MULTD F4,F2,F2

ADDD F8,F0,F6 (assume F8 is unused)

1.1. movmov eaxeax,, [m1][m1]

2.2. aadddd eaxeax, 2, 2

3.3. movmov [m2], [m2], eaxeax

4.4. movmov eaxeax,, [[mm3]3]

5.5. aadddd eaxeax, 4, 4

6.6. movmov [m4], [m4], eaxeax

4, 5, 6 4, 5, 6 cancan bebe executedexecuted inin parallelparallel withwith 1, 2, 31, 2, 3but after registers renaming only!!!

Page 106: 01 intel processor architecture core

106

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Re-Order Buffer (ROB)

Mechanism for renaming and retirement

Table contains in-order instructions order instructions

• Instructions are entered in order

• Registers renamed by the entry number

• Once assigned: execution order unimportant

• After execution: entries marked

• An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired -

• Update “real registers real registers” with value of renamed regs

• Update memory

• Leave the ROB

Page 107: 01 intel processor architecture core

107

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Reservation Station(s)

Pool(s) of all “not yet executed” instructions

Maintains operands status “ready / not-ready”

Each cycle, executed instructions make more operands “ready”

Instructions whose all operands are “ready” can be “dispatched”for execution

Dispatcher chooses which of the “ready” instructions will be executed next

Page 108: 01 intel processor architecture core

108

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Fighting Stalls: Memory Order Buffer (MOB)

Idea - allow out of order among memory operations

Problem Memory dependencies cannot fully resolved statically (memory disambiguation)

Structure similar in concept to ROB

Every access is allocated an entry

Address & data (for stores) are updated when known

Load is checked against all previous stores: Load is checked against all previous stores

ReturnReturn

Page 109: 01 intel processor architecture core

109

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)

Many buses are sized for worst case data

(x86 instruction of 15 bytes)(ALU can write-back 128 bits)

Improved Energy Efficiency

Page 110: 01 intel processor architecture core

110

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

By splitting buses to dealwith varying data widths,

we can gain the performancebenefit of bus width while

maintaining C dynamiccloser to thinner buses

Improved Energy Efficiency

Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)

Page 111: 01 intel processor architecture core

111

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge refreshment

Notable features

Micro-architecture drill-down

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 112: 01 intel processor architecture core

112

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

System Bus

2nd Level Cache 1st Level Cache (Data)

Bus Unit

Decode/IQ

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

Branch Prediction Unit

Intel® Core® Micro-architecture Overview

Front EndFront EndExecution CoreExecution Core

Page 113: 01 intel processor architecture core

113

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Intel® Core® Micro-architecture Drill-down

icachebranch

predictionunit

instructionqueue

MS

instructiondecode

predecode

registeralias table

ALLOC Re-Order Buffer

ReservationStation

integerFP

SIMD(3x)

load

storeaddress

storedata

memoryorderbuffer

datacacheunit

page miss handler

Page 114: 01 intel processor architecture core

114

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Example Code to Be Used

addps xmm0, [EAX+16]

mulps xmm0, xmm0

movps [EAX+240], xmm0

cmp EAX, 100000

jge label

Page 115: 01 intel processor architecture core

115

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge refreshment

Notable features

Micro-architecture drill-down

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 116: 01 intel processor architecture core

116

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction preparation before executed

• Instruction Fetch Unit

• Instruction Queue

• Instruction Decode Unit

• Branch Prediction Unit

Page 117: 01 intel processor architecture core

117

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit

Instruction Queue

Instruction Decode Unit

Branch Prediction Unit

Page 118: 01 intel processor architecture core

118

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Fetch Unit

Prefetches instructions that are likely to be executed

Caches frequently-used instructions

Predecodes and Buffers instructions

Intel® Core™ Microarchitecture – Front End

2nd Level Cache 1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

BTBs/Branch Prediction

Front EndFront EndExecution CoreExecution Core

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

Page 119: 01 intel processor architecture core

119

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Fetch Unit (cont.)

I-Cache (Instruction Cache)

• 32 KBytes / 8-way / 64-byte line

• 16 aligned bytes fetched per cycle

ITLB (Instruction Translation Lookaside Buffer)

• 128 4k pages, 8 2M pages

Instruction Prefetcher

• 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers

Instruction Pre-decoder

• Instruction Length Decode (predecode)• Avoid Length Changing Prefix, for example

• The REX (EM64T) prefix (4xH) is not an LCP

Intel® Core™ Microarchitecture – Front End

Avoid in loop:

MOV dx, 1234h

Instruction Prefixes (66H/67H)Instruction Prefixes (66H/67H) OpcodeOpcode ModRModR/M/M SIBSIB DisplacementDisplacement ImmediateImmediate

Page 120: 01 intel processor architecture core

120

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit

Instruction Queue

Instruction Decode Unit

Branch Prediction Unit

Page 121: 01 intel processor architecture core

121

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Queue

Buffer between instruction pre-decode unit and decoder

• up to six predecoded instructions written per cycle

• 18 Instructions contained in IQ

• up to 5 Instructions read from IQ

Potential Loop cache

Loop Stream Detector (LSD) support

• Re-use of decoded instruction

• Potential power saving

Intel® Core™ Microarchitecture – Front End

2nd Level Cache 1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

BTBs/Branch Prediction

Front EndFront EndExecution CoreExecution Core

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

Page 122: 01 intel processor architecture core

122

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit

Instruction Queue

Instruction Decode Unit

Branch Prediction Unit

Page 123: 01 intel processor architecture core

123

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode

Decode the instructions into micro-ops

Ready for the execution in OOO core

Intel® Core™ Microarchitecture – Front End

2nd Level Cache 1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

BTBs/Branch Prediction

Front EndFront EndExecution CoreExecution Core

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

Page 124: 01 intel processor architecture core

124

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion

• Stack Pointer Tracking

Intel® Core™ Microarchitecture – Front End

Page 125: 01 intel processor architecture core

125

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Decoders

Instructions converted to micro-ops (uops)

• 1-uop includes load+op, stores, indirect jump, RET...

4 decoders:1 “large” and 3 “small”

• All decoders handle “simple” 1-uop instructions

• One large decoder handles instructions up to 4 uops

All decoder working in parallel

• Four(+) instructions / cycle

Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex)

Intel® Core™ Microarchitecture – Front End

Page 126: 01 intel processor architecture core

126

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Code Sequence in Front End

these instructions tookmore than one fetchas they are 22 bytes

IQ buffers them together

all instructions are decodable by all decoders

CMP and adjacent JCCare “fused” into a singleuop. up to 5 instructions decoded per cycle

Intel® Core™ Microarchitecture – Front End

cmpcmp EAX, 100000 EAX, 100000

jnejne labellabel

movpsmovps [EAX+240], xmm0[EAX+240], xmm0

mulpsmulps xmm0, xmm0xmm0, xmm0addpsaddps xmm0, [EAX+16]xmm0, [EAX+16]

Large(dec0)

small(dec1)

small(dec2)

small(dec3)

cmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

IQ

Page 127: 01 intel processor architecture core

127

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion

• Stack Pointer Tracking

Intel® Core™ Microarchitecture – Front End

Page 128: 01 intel processor architecture core

128

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Macro - Fusion

Roughly ~15% of all instructions are conditional branches.

Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.

Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.

Not supported in EM64T long mode

Intel® Core™ Microarchitecture – Front End

cmpjae eax, [mem], label

Scheduler

Execution

flags and target to Write back

BranchEval

Page 129: 01 intel processor architecture core

129

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Queue

addps xmm0, [EAX+16]

dec0

Cycle 2

Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0

movps [EAX+240], xmm0

addps xmm0, [EAX+16]

cmp eax, 100000

dec1

dec2

dec3

jge label

movps [EAX+240], xmm0

Instruction Decode / Macro-Fusion Absent

Read four instructions from Instruction Queue

Each instruction gets decoded into separate uops

Enabling Example

for (int i=0; i<100000; i++) {

}

cmp eax, 100000

jge label

dec0

Intel® Core™ Microarchitecture – Front End

Page 130: 01 intel processor architecture core

130

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Queue

addps xmm0, [EAX+16]

dec0Cycle 1

mulps xmm0, xmm0

mulps xmm0, xmm0

movps [EAX+240], xmm0

addps xmm0, [EAX+16]

cmpjae eax, 100000, label

dec1

dec2

dec3

movps [EAX+240], xmm0

Instruction Decode / Macro-Fusion Presented

Read five Instructions from Instruction Queue

Send fusable pair to single decoder

Single uop represents two instructions

Enabling Example

for (unsigned int i=0; i<100000; i++) {

}

cmp eax, 100000

jae label

Intel® Core™ Microarchitecture – Front End

Page 131: 01 intel processor architecture core

131

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Macro – Fusion (cont.)

Benefits

• Reduces latency

• Increased renaming

• Increased retire bandwidth

• Increased virtual storage

• Power savings

Enabling Greater Performance & Enabling Greater Performance &

EfficiencyEfficiency

Intel® Core™ Microarchitecture – Front End

Page 132: 01 intel processor architecture core

132

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion

• Stack Pointer Tracking

Intel® Core™ Microarchitecture – Front End

Page 133: 01 intel processor architecture core

133

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline

Intel® Core™ Microarchitecture – Front End

Page 134: 01 intel processor architecture core

134

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

std xmm0, [eax+240]

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240

Intel® Core™ Microarchitecture – Front End

st xmm0, [eax+240]

Page 135: 01 intel processor architecture core

135

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode

Decoders

Features

• Macro-fusion

• Micro-fusion

• Stack Pointer Tracking

Intel® Core™ Microarchitecture – Front End

Page 136: 01 intel processor architecture core

136

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding)

ESP is calculated by dedicate logic

• No explicit Micro-Ops updating ESP

• Micro-Ops saving

• Power savingESPd=8

Decoder

0

Decoder

1

Decoder

N

4

PUSH EAX PUSH EDX POP EBX

0

Recovery

Information

.

.

.

Intel® Core™ Microarchitecture – Front End

Page 137: 01 intel processor architecture core

137

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit

Instruction Queue

Instruction Decode Unit

Branch Prediction Unit

Page 138: 01 intel processor architecture core

138

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Branch Prediction Unit

Allow executing instructions long before the branch outcome is decided

• Superset of Prescott / Pentium-M features

• One taken branch every other clock

• Branch predictions for 32 bytes at a time, twice the width of the fetch engine

Intel® Core™ Microarchitecture – Front End

2nd Level Cache 1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

BTBs/Branch Prediction

Front EndFront EndExecution CoreExecution Core

branchprediction

unit

MS

instructiondecode

icache

instructionqueue

predecode

Page 139: 01 intel processor architecture core

139

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Branch Prediction Unit (cont.)

16-entry Return Stack Buffer (RSB)

Front end queuing of BPU lookups

Type of predictions

• Direct Calls and Jumps

• Indirect Calls and Jumps

• Conditional branches

Intel® Core™ Microarchitecture – Front End

Page 140: 01 intel processor architecture core

140

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:

Branch miss-predictions reduced by >20%

Indirect Branch Predictor Loop Detector

Intel® Core™ Microarchitecture – Front End

Page 141: 01 intel processor architecture core

141

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Agenda

Introduction

Knowledge preparation

Notable features

Micro-architecture drill-down

• Front End

• Out-Of-Order Execution Core

• Memory Sub-system

Coding considerations

Page 142: 01 intel processor architecture core

142

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Core® Micro-architecture Execution Core

2nd Level Cache 1st Level Cache (Data)

IQ/ Decode

Instruction Fetch Unit

Execution Unit

Renamer/AllocatorBuffers(Retirement)

Scheduler

BTBs/Branch Prediction

Front EndFront End Execution CoreExecution Core

Accepted decoded u-ops, assign resources, execute and retire u-ops

• Renamer

• Reservation station (RS)

• Issue ports

• Execution Unit

integerFP

SIMD(3x)

load

storeaddress

storedata

registeralias table

ALLOC Re-Order Buffer

ReservationStation

Page 143: 01 intel processor architecture core

143

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution Core Building Blocks

Ports (number)Ports (number)

2 Load2 Load3,4 Store3,4 Store

Memory SubMemory Sub--systemsystem

0,1,50,1,5SIMDSIMD

IntegerInteger

SIMD/IntegerSIMD/IntegerMULMUL

0,1,50,1,5IntegerInteger

0,1,50,1,5FloatingFloating

PointPointExecution UnitExecution Unit

ROBROB

RenamerRenamer

RSRS

Intel® Core™ Microarchitecture – Execution Core

Page 144: 01 intel processor architecture core

144

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Rename and Resources

4 uops renamed / retired per clock

• one taken branch, any # of untaken

• one fxchg per cycle

Uops written to RS and ROB

• Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS

• RS waits for sources to arrive allowing OOO execution

• Registers not “in flight” read from ROB during RS write

Intel® Core™ Microarchitecture – Execution Core

registeralias table

ALLOC Re-Order Buffer

ReservationStation

Page 145: 01 intel processor architecture core

145

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Issue Ports and Execution Units

6 dispatch ports from RS

• 3 execution ports • (shared for integer / fp / simd)

• load

• store (address)

• store (data)

128-bit SSE implementation

• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)

• Port 1 has packed add (3 cycles all precisions)

FP data has one additional cycle bypass latency

• Do not mix SSE FP and SSE integer ops on same register

Intel® Core™ Microarchitecture – Execution Core

integerFP

SIMD(3x)

load

storeaddress

storedata

Avoid: Addps XMM0,XMM1Pand xmm0,xmm3Addps xmm2,xmm0

Better: Addps XMM0,XMM1Addps xmm2,xmm0Pand xmm0,xmm3

Page 146: 01 intel processor architecture core

146

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

The Out Of Order

each uop only takes a single RS entry

load + add dispatches twice (load, then add)

mulps dispatches once when load + add to write back

sta + std dispatches twice

sta (address) can fire as early as possible

std must wait for mulps to write back

cmpjne dispatches only once (functionality is truly fused)

no dependency, can fire as early as it wants

Intel® Core™ Microarchitecture – Execution Core

cmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

RS

Page 147: 01 intel processor architecture core

147

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Dispatching to OOO EXEIntel® Core™ Microarchitecture – Execution Core

RScmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

cmpjne EAX, 100000, labelsta_std [EAX+244], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

cmpjne EAX, 100000, labelsta_std [EAX+248], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

cmpjne EAX, 100000, labelsta_std [EAX+24C], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]

5 GP (incl jmp)

4 STD

3 STA

2 Load

1 GP (incl FP add)

0 GP (incl FP mul)

Page 148: 01 intel processor architecture core

148

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Advanced Memory Access

3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2

Miss Latencies

• L1 miss hits L2 ~ 10 cycles

• L2 miss, access to memory ~300 cycles (server/FBD)

• L2 miss, access to memory ~165 cycles (Desk/DDR2)• C step broadwater is reported to have ~50ns latency

Cache Bandwidth

• Bandwidth to cache ~ 8.5 bytes/cycle

Memory Bandwidth

• Desktop ~ 6 GB/sec/socket (linux)

• Server ~3.5 GB/sec/socket

Intel® Core™ Microarchitecture – Memory Sub-system

Page 149: 01 intel processor architecture core

149

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Intel® Core™Microarchitecture

Use CMP = employ both Cores

• Go to multithreading!

Prefer SSE as much as possible. If you didn’t do it so far, vectorize the code now!!

• Intel Compiler has very good vectorization engine

Align data and data layout (sequential)

• To align use __declspec(align (16)) float a[1000];

Page 150: 01 intel processor architecture core

150

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Optimizing for Intel® Core™Microarchitecture (advanced)

Use Intel VTune™ Performance Analyzer for performance problems revealing

• CPI

• Specific CPU events for Core-arch:

RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc-see VTune help

Page 151: 01 intel processor architecture core

151

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Front End Issue DebuggingLook for Front End optimization only when code is FE bound

• Reservation station (RS) is the front end and allocation target

• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue• If there are no issues in the FE the RS should be full above 30% of the time

Front End typical issues:

• Code is too big to fit in the L1:• When L2_IFETCH.SELF.MESI happens every 10-15 instructions • Code that could have been with CPI 1 will be around 2• 14 cycles penalty for L1 demand miss

• Average instruction size above 6 bytes• Happens typically with SSE code and more with EM64T• Can have impact only in case of otherwise excellent CPI

• Code with length changing prefix issues (LCP) • Penalty of 6 cycles or more • Look at ILD_STALL VTune event

FrontFront--End should not be the bottleneck. End should not be the bottleneck.

Focus on Front End issues only if it is the issue. Focus on Front End issues only if it is the issue.

Page 152: 01 intel processor architecture core

152

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution micro architecture

The busiest port may determent the potential execution speed

Single clock latency operations are best

• Different latency operations can create writeback conflicts �Creating bubble in the port

Look at the dependency chains to see the potential parallelism

• Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports

• High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound

• The ROB has 96 entries

• High RESOURCE_STALLS.ROB_FULL percentage only if

• Code has long latency instructions (L2 misses)

• Other code can be executed while waiting

Execution stage: The key for good performance.Execution stage: The key for good performance.

Focus on port utilization and dependency chains Focus on port utilization and dependency chains

Page 153: 01 intel processor architecture core

153

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Execution micro architecture

The Divider is a big potential stall source

• DIV for the number Divide operations executed

• IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy

• Try to find some useful work to do in parallel with divide operations

Extra cycle latency for bypass betweenexecution domains

• For example: FP (ADDPS) and logicalops (PAND) on XMMn

• DELAYED_BYPASS.FP

• DELAYED_BYPASS.LOAD

• DELAYED_BYPASS.SIMD

Data Cache Unit

SIMD

Integer

integer /

SIMD

MUL

IntegerFloating

Point

load

store (address)

store (data)

dtlb

memory orderring

store forwarding

0,1,5 0,1,5 0,1,5

234

EXE

Page 154: 01 intel processor architecture core

154

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

Enhancements and Optimization Opportunities

IP Prefetcher

• Prefetches stride loads associated with the same IP• Uses History table

• Use VTune events to identify misses when expected prefetches

Memory Disambiguation

• Predicts when OK to fire load before preceding stores with unknown address• Misprediction triggers Pipeline flash and load restart

• Disambiguation is temporarily disabled if frequently fails

• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address

• In case not to the same address:Possible reasons for not working: Address collision with other load(s)

Page 155: 01 intel processor architecture core

155

Copyright © 2006, Intel Corporation. All rights reserved.

Intel® Processor Micro-architecture - Core® microarchitecture

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College

4k Aliasing

• OOO engine can fire Load before preceding Store if not collides on the Store’s address• Address collision serializes execution

• Address checking uses only the last 12 bits (4K)• False blocking - if Load’s & Store’s addresses have 4KB offset

• e.g. accessing large, power of two, sized arrays in a loop

• Resolve 4K aliasing conflicts by changing memory layout

• VTune event LOAD_BLOCK.OVERLAP_STORE

Load block cases

• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched • Store address unknown - LOAD_BLOCK.STA

• Loads blocked by a preceding store with unknown address

• Store data unknown - LOAD_BLOCK.STD

• Loads blocked by a preceding store with unknown data

• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE• This includes mainly uncacheable loads and split loads (loads that cross the cache

line boundary)

Other Opportunities for Performance Gain in the memory sub-system