01 intel processor architecture core
TRANSCRIPT
Intel® Core™ Microarchitecture
Intel® Software College
2
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Objectives
After completion of this module you will be able to describe
• Components of an IA processor
• Working flow of the instruction pipeline
• Notable features of the architecture
3
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
4
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
5
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software CollegeIndustrial Recognition
PC Format May 2006PC Format May 2006““ Intel Strikes Back! Conroe is the name . Pistol. Pistol --whipping Athlon whipping Athlon 64s into burger meat is the game..64s into burger meat is the game.. ““
Intel Regains Performance Crown, Anandtech
“… At 2.8 or 3.0GHz, a Conroe EE would offer even stro nger performance than what we’ve seen here.”
Intel Reveals Conroe Architecture, Extremetech“… And not only was the Intel system running at 2.66GH z— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea f or a bit…”
Conroe Benchmarks - Intel Showing Big Strength Hot Hardware.com“… Intel is poised to change the face of the desktop computing landscape…”
Intel Dishes the Knockout Punch to AMD with Conroe , GD Hardware.com“… the results were far more than we could hope for an d it'll be amusing to see AMD's response to this beat-down ses sion
Intel's Next Generation Microarchitecture UnveiledIntel's Next Generation Microarchitecture UnveiledReal World Tech
“Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “
6
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance Summary
Intel® Core™ Microarchitecture dramatically boosts Intel platform performance
• Conroe & Woodcrest drive clear Desktop/Server performance leadership
• Merom extends Intel Mobile performance leadership
Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the Multi-Core era
• Intel’s 3rd generation dual-core (while competition stuck on 1st
generation)
• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient
The “Core™ Effect”: Intel® Core™ Microarchitecture ramp fuels broad roadmap accelerations
Best Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: EnergyBest Processor on the Planet: Energy----Efficient Performance Efficient Performance Efficient Performance Efficient Performance 1111
20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts1 !
1 Based on SPECint*_rate_base2000
7
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
• Architecture VS Microarchitecture
• CISC VS RISC
• Performance Measurements
• Pipeline Design
• Power and Energy
• Chip Multi-Processing
Notable features
Micro-architecture tour
Coding considerations
8
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Architecture and Micro-architecture
What is Computer Architecture?
• Architecture is the set of features which are externally visible:
• Instruction set
• Registers
• Addressing modes
• Bus protocols
Intel Architectures (IA)
• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)
• X87 (Floating Point extension)
• MMX (Multi-Media extension)
• SSE, SSE2, SSE3 (SIMD Streaming Extension)
• Intel® 64/EM64T (64-bit Integer extension of IA32)
• IA64 (Intel new 64-bit architecture)
• Itanium/Itainium2 processor family
? ? Go to detail!Go to detail!
9
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Architecture and Micro-architecture (cont.)
What is Micro-architecture?
• Same as m–Architecture or u-Architecture
• “Invisible” features that provide meaningful value to the end user (whatever makes you buy a new compatible PC)
• Programs run faster F Improved Performance
• Reduced Power consumption F Extended Battery life
• H/W fits into Smaller Form Factor
10
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel NetBurst ®P5 P6 Banias
Intel® Architecture History
Architecture:Instruction set definition and compatibility
EPIC* (Itanium ®) IA-32 IXA* (XScale)
Microarchitecture:Hardware implementation maintaining instruction set compatibility with high-level architecture
Processors:Productized implementation of Microarchitecture
Examples:
Examples:
Examples:
PentiumPentium ®® ProProPentiumPentium ®® II/IIIII/IIIPentiumPentium ®®
PentiumPentium ®® 44PentiumPentium ®® DD
XeonXeon ®®PentiumPentium ®® MM
* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing
11
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Mobile Microarchitecture
Intel® NetBurst®
+ New Innovations
Intel® Core™ Microarchitecture Processors
IntelIntel®® CoreCore™™ 2 Duo/Quad/Extreme processors2 Duo/Quad/Extreme processors
12
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
RISC Approach to CPU design
Optimize H/W for common basic operations
• Fixed instruction length
• Shorter Execution Pipeline
• Ease of Instruction Level Parallelism
• Large number of registers
• Less memory accesses
• ‘Load/Store’ architecture
• Shorter Execution Pipeline
• Ease of advancing Loads
• Branch Hints
• Reduce pipeline flush events
• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support
• No ‘complex’ H/W instructions
• Handle exceptional conditions in S/W
Examples: MIPS, IBM Power and PowerPC, Sun Sparc
Achieve Maximum performance by right partitioning between H/W and S/W
(RISC = Reduced Instruction Set Computers)
13
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
CISC Approach to CPU design
Rich architecture
• Variable length instructions.
• Complex addressing modes.
On-chip HW / SW partitioning required
• H/W keeps executing ‘simple’ stuff
• Complex instructions are ‘emulated’ using u-code routines from ROM
• More instructions treated as ‘simple’ as more H/W is available
COMPATIBILITY has some major advantages:
• Large (and forever increasing) software base
• Code development tools
• Expertise
• H/W - S/W spiral
Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW
(CISC = Complex Instruction Set Computers)
14
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance is the reciprocal of the “Time of execution”:
Were:
L = Code Length (# of machine instructions)
CPI = Clock cycles Per Instruction
Tc = Clock period (nSecs)
Substitute:
IPC = Instructions Per Cycle = 1/CPI
F = Frequency = 1/Tc
CTCPILExecutionofTimeePerformanc
**
1
__
1 =≈
L
FIPCePerformanc
*≈
Improve Timing
Arch Enhancements
Improve ILP
Performance Measurement
15
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Performance Measurement (cont.)
Performance considerations:
• Which Code/Application to run?
• Which OS?
• Which other components in the platform?
• Under which thermal conditions?
• Multithreading? Multiprocessing?
Benchmarks examples
• Industry Standard
• Spec (ISPEC, FSPEC)
• TPC
• Commercial
• SysMark
• MobileMark
• PCMark
• Sandra
• ScienceMark
• Applications
• Video (Windows Media encoder, DivX)
• Audio (Lame MP3)
• Compression (RAR)
• Content creation (3DSM, Photoshop, Premiere)
• Latest Games (Doom III, FarCry, but changes fast)
• Specific industries use specific benchmarks
• Linux compilation, POVRay, LinPack, lmbench
16
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Design Considerations for Different Market Segments
Constrains:
• Thermally, area constrained � Desktop
• Unconstrained � Extreme
• Very area constrained � Value
• Thermally, Energy and Area constrained � Mobile
• Thermally, Energy � Servers
Micro-architecture is the Art of Tradeoffs between:
• Schedule
• Requirements / Standards
• Performance
• Features
• Power / Energy
• Area / Cost
17
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Design Metrics
IPC = Instructions per Cycle
• The more the better
Latency – same as Response Time
• The time interval between
• when any request for data is made and
• when the data transfer completes
• The less the better
Throughput
• The amount of work completed by the system per unit of time.
• The more the better
• ops/sec
18
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
CPU Pipeline
Break the work to smaller pieces
• Four basic stages of instruction life
• Fetch - bring instruction to core
• Decode - read operands from register
• Execute - perform the operation
• Writeback - save result to register
• Execution timing of simple instructions
(legend: “op src1,src2 � dst”)
add eax, ebx ���� eax F D E W
sub ecx, edx ���� ecx F D E W
Increased throughput
• increased number of completed instructions per cycle
19
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Pipeline Design - Explore Parallelism
New instruction not always depends on previous one
• Can start new instruction before previous one is finished
• ...if different stages use different H/W resources
Run instructions in parallel (pipeline)
Add eax, ebx ���� eax F D E W
Sub ecx, edx ���� ecx F D E W
Or edi, esi ���� edi F D E W
Need to balance pipe stages
• Each stage should take same time for best throughput and utilization
ExecDecodeFetch WB
Clock cycle is determined by the longest path!
ExecDecodeFetch WBExecDecodeFetch WB
ExecDecodeFetch WB
20
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Pipeline Design – Fighting Stalls
Data flow dependency (instructions output/input)
• Solved by bypasses, renaming etc
Control flow dependencies
• Solved by branch prediction
Others (Cache misses, long latency instructions)
• Solved by other dynamic scheduling techniques
? ? Go to detail!Go to detail!
21
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Race of CISC vs. RISC
In modern CPUs Advanced µ-Architecture Techniques minimize the advantages of RISC over CISC
• Branch Prediction
• Reduces the effect of extra pipeline stages
• Register Renaming
• Effectively Increase the Number of Registers
• Out Of Order
• Reduce Number of stalls caused by shortage of registers
• Speculative Execution
• Further Reduce Number of stalls
• Power saving features
• Reduce the overhead when not needed.
22
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
µop – Intel’s Take of the CICS/RISC Race
(CISC) Instructions are translated into one or more (RISC) uop(micro-operation)s
• Fixed format
• Wide and simple
• Temp registers
Usually one uop per instruction
Complex instruction can be thousands of uops
Stores divided into two uops (STA and STD)
Fusion play games here
23
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Power and Energy
Maximum power (TDP):
• � Cooling requirements
• � Cooling solution
• � Computer form factor and acoustic noise
Average power
• � Battery life
• � Electricity bill
General calculation:
• P = frequency * voltage^2 * activity factor * capacitance + leakage
Reducing TDP
• Less transistors and wires
• Smaller transistors and wires
• Power features � less activity
• Low leakage transistors
Reducing average power
• Energy efficiency
• Power states
• Lower leakage
24
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Dual/Multi Core and SMT
Put more than one core per package
Architectural change:
• Software must be multi-threaded or multi-process
• …but backward compatible with multiprocessor systems (MP)
Several ways of implementing it
• All of them being used
Core
LLC
I/O
Core
LLC
I/O
Core
LLC
I/O
Core
LLC
Core
LLC
I/O
Core
SMT: Run two (or more) threads on the same core, simultaneously
25
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel Approach
While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases
have come from the thread level parallelism.
While single core performance has increased due to clock speed, increased cache and improved ILP the biggest performance increases
have come from the thread level parallelism.
1 Threads1 Threads
IntelIntel ®®PentiumPentium ®®
2 Threads2 Threads
IntelIntel ®®PentiumPentium ®®With HTWith HT
IntelIntel ®®PentiumPentium ®® D D Processor Processor
2 Threads2 Threads
4 Threads4 Threads
2 Threads2 Threads
IntelIntel ®®Core 2 DuoCore 2 Duo ®®
IntelIntel®®
XQ6700*XQ6700*
Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006
StateExecution UnitsCacheBus
80 Threads80 Threads
?
26
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
A “Acronym Cheat Sheet” of Parallel Computing
CMP: Chip Multi Processor (two or more cores per package)
• Dual Core: two cores in same package
• Quad Core: four cores in same package
DP: Dual Processor (two packages)
MP: Multi Processor (four or more packages)
SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)
27
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
• Wide Dynamic Execution
• Smart Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Intelligent Power Capability
Micro-architecture tour
Coding considerations
28
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable FeaturesIntel® Wide Dynamic Execution
• 14-stage efficient pipeline• Wider execution path
• Advanced branch prediction
• Macro-fusion• Roughly ~15% of all instructions are conditional branches
• Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline
• Micro-fusion• Merges the load and operation micro-ops into one macro-op
• 64-Bit Support• Merom, Conroe, and Woodcrest
support EM64T
2M/4Mshared L2
Cache
up to10.4 Gb/s
FSB
L1 D-Cache and D-TLB
LoadLoad
SchedulersSchedulers
Retirement UnitRetirement Unit((ReOrderReOrder Buffer)Buffer)
ALUBranch
MMX/SSEFPmove
DecodeDecode
Rename/AllocRename/Alloc
uCodeuCodeROMROM
Instruction Fetch Instruction Fetch and and PreDecodePreDecode
ALUFAdd
MMX/SSEFPmove
ALUALUFMulFMul
MMX/SSEMMX/SSEFPmoveFPmove
Instruction QueueInstruction Queue
StoreStore
4444
4444
5555
29
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)
Intel® Advanced Memory Access
• Improved prefetching
• Memory disambiguation
• Advance load before a possible data dependency (pointer conflict)
• Earlier loads hide memory latencies
30
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)
Intel® Advanced Smart Cache
• Multi-core optimization
• Shared between the two cores
• Advanced Transfer Cache architecture
• Reduced bus traffic
• Both cores have full access to the entire cache
• Dynamic Cache sizing
31
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache
CPU1 CPU2
Memory
Front Side Bus (FSB)
Cache Line
Shipping L2 Cache Line~Half access to memory
32
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
CPU2
Intel® Core® Micro-architecture Notable Features (cont.)Advantages of Shared Cache (cont.)
CPU1
Memory
Front Side Bus (FSB)
Cache Line
L2 is shared:No need to ship cacheline
33
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)
Intel® Advanced Digital Media Boost
• Single Cycle SIMD Operation
• 8 Single Precision Flops/cycle
• 4 Double Precision Flops/cycle
• Wide Operations
• 128-bit packed Add
• 128-bit packed Multiply
• 128-bit packed Load
• 128-bit packed Store
• Support for Intel® EM64T instructions
CoreCore™™ µµµµµµµµarch arch
PreviousPrevious
X4X4
Y4Y4
X4opY4X4opY4
SOURCESOURCE
X1opY1X1opY1
X3X3
Y3Y3
X3opY3X3opY3
X2X2
Y2Y2
X2opY2X2opY2
X1X1
Y1Y1
X1opY1X1opY1
DESTDEST
SSE/2/3 OPSSE/2/3 OP
X2opY2X2opY2
X3opY3X3opY3X4opY4X4opY4
CLOCKCLOCK
CYCLE 1CYCLE 1
CLOCKCLOCK
CYCLE 2CYCLE 2
00127127
CLOCKCLOCK
CYCLE 1CYCLE 1
SIMD OperationSIMD Operation(SSE/SSE2/SSE3/SSSE)(SSE/SSE2/SSE3/SSSE)
34
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features
Intel® Advanced Digital Media Boost
• Additional Media Instructions - Supplemental Streaming SIMD Extensions 3 (SSSE3)
• 16 new packed integer instructions
• Targeting video encode/decode
• Significantly improved strings
• REP MOVS and REP STOS
• ~8 bytes / cycle throughput
• mileage may vary
35
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable FeaturesIntel® Advanced Digital Media Boost
• Supplemental SSE-3 (SSSE-3)
Packed SIGN
Packed Shuffle Bytes
Packed multiply High with Round and Scale
Multiply and Add Packed Signed/Unsigned bytes
Packed Align Right
Packed Absolute Values
Horizontal Addition/Subtraction
PSIGNB/W/D
PSHUFB
PMULHRSW
PALIGNR
PMADDUBSW
PABSB, PABSW, PABSD
PHADDW, PHADDSW, PHADDD, PHSUBW, PHSUBSW, PHSUBD
36
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)
Intelligent Power Capability
• Advanced power gating & Dynamic power coordination
• Multi-point demand-based switching
• Voltage-Frequency switching separation
• Supports transitions to deeper sleep modes
• Event blocking
• Clock partitioning and recovery
• Dynamic Bus Parking
• During periods of high performance execution, many parts of the chip core can be shut off
37
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
38
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Drill-down
icachebranch
predictionunit
instructionqueue
MS
instructiondecode
predecode
registeralias table
ALLOC Re-Order Buffer
ReservationStation
integerFP
SIMD(3x)
load
storeaddress
storedata
memoryorderbuffer
datacacheunit
page miss handler
39
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge refreshment
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
40
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction preparation before executed
• Instruction Fetch Unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
41
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction QueueIntel® Core™ Microarchitecture – Front End
Buffer between instruction pre-decode unit and decoder
• up to six predecoded instructions written per cycle
• 18 Instructions contained in IQ
• up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction
• Potential power saving
42
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Macro - Fusion
Roughly ~15% of all instructions are conditional branches.
Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.
Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.
Not supported in EM64T long mode
Intel® Core™ Microarchitecture – Front End
cmpjae eax, [mem], label
Scheduler
Execution
flags and target to Write back
BranchEval
43
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Queue
addps xmm0, [EAX+16]
dec0
Cycle 2
Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmp eax, 100000
dec1
dec2
dec3
jge label
movps [EAX+240], xmm0
Macro-Fusion Absent
Read four instructions from Instruction Queue
Each instruction gets decoded into separate uops
Enabling Example
for (int i=0; i<100000; i++) {
…
}
cmp eax, 100000
jge label
dec0
Intel® Core™ Microarchitecture – Front End
44
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Queue
addps xmm0, [EAX+16]
dec0Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmpjae eax, 100000, label
dec1
dec2
dec3
movps [EAX+240], xmm0
Macro-Fusion Presented
Read five Instructions from Instruction Queue
Send fusable pair to single decoder
Single uop represents two instructions
Enabling Example
for (unsigned int i=0; i<100000; i++) {
…
}
cmp eax, 100000
jae label
Intel® Core™ Microarchitecture – Front End
45
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
Intel® Core™ Microarchitecture – Front End
46
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
std xmm0, [eax+240]
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240
Intel® Core™ Microarchitecture – Front End
st xmm0, [eax+240]
47
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Branch Prediction Improvements
Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:
Branch miss-predictions reduced by >20%
Indirect Branch Predictor Loop Detector
Intel® Core™ Microarchitecture – Front End
48
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
49
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Execution Core
Accepted decoded u-ops, assign resources, execute and retire u-ops
• Renamer
• Reservation station (RS)
• Issue ports
• Execution Unit
integerFP
SIMD(3x)
load
storeaddress
storedata
registeralias table
ALLOC Re-Order Buffer
ReservationStation
50
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution Core Building Blocks
Ports (number)Ports (number)
2 Load2 Load3,4 Store3,4 Store
Memory SubMemory Sub--systemsystem
0,1,50,1,5SIMDSIMD
IntegerInteger
SIMD/IntegerSIMD/IntegerMULMUL
0,1,50,1,5IntegerInteger
0,1,50,1,5FloatingFloating
PointPointExecution UnitExecution Unit
ROBROB
RenamerRenamer
RSRS
Intel® Core™ Microarchitecture – Execution Core
51
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Issue Ports and Execution Units
6 dispatch ports from RS
• 3 execution ports
• (shared for integer / fp / simd)
• load
• store (address)
• store (data)
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
Intel® Core™ Microarchitecture – Execution Core
52
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Retirement Unit
ReOrder Buffer (ROB)
• Holds micro-ops in various stages of completion
• Buffers completed micro-ops
• updates the architectural state in order
• manages ordering of exceptions
Intel® Core™ Microarchitecture – Execution Core
registeralias table
ALLOC Re-Order Buffer
ReservationStation
53
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
54
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Memory Sub-System
Memory Ordering Buffer
• Store Address Buffer• Stores the address of each store not actually performed• Loads compare address to any store older than itself
• If it find a hole…
• Store Data Buffer• Stores data of each store not actually performed• If load hit on the SAB, it forward the data from here
• Load Buffer• Stores address of non-retired loads• For snoops and re-dispatch
• One 128-bit load and one 128-bit store per cycle to different memory locations
• Out of order Memory operations
55
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Memory Sub-System (cont.)32k D-Cache (8-way, 64 byte line size)
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache to cache transfer
• improves producer / consumer style MP
Wider interface to L2
• reduced interference
• processor line fill is 2 cycles
Higher bandwidth from the L2 cache to the core
• ~14 clock latency and 2 clock throughput
Load & Store Access order1. L1 cache of immediate core
2. L1 cache of the other core
3. L2 cache
4. Memory
BusBusBusBusBusBusBusBus
2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache2 MB L2 Cache
Core1Core1Core1Core1Core1Core1Core1Core1 Core2Core2Core2Core2Core2Core2Core2Core2
Intel® Core™ Microarchitecture – Memory Sub-system
56
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Enhanced Data Pre-fetch Logic
Speculates the next needed data and loads it into cache by HW and/or SW
Main Parking Lot (External Memory)
Valet Parking Area (L2 Cache)
Intel® Core™ Microarchitecture – Memory Sub-system
Door(L1 Cache)
57
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont.)• L1D cache prefetching• Data Cache Unit Prefetcher
• Known as the streaming prefetcher• Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache
• Instruction Based Stride Prefetcher• Prefetches based upon a load having a regular stride• Can prefetch forward or backward 2 Kbytes• 1/2 default page size
• L2 cache prefetching: Data Prefetch Logic (DPL)• Prefetches data to the 2nd level cache before the DCU requests
the data• Maintains 2 tables for tracking loads
• Upstream – 16 entries• Downstream – 4 entries
• Every load is either found in the DPL or generates a new entry• Upon recognition of the 2nd load of a “stream” the DPL will
prefetch the next load
Intel® Core™ Microarchitecture – Memory Sub-system
58
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Memory Disambiguation
Intel® Core™ Microarchitecture – Memory Sub-system
Memory Disambiguation predictor
• Loads that are predicted NOT to forward from preceding store are allowed to schedule as early as possible
• increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement
• Extension to existing coherency mechanism
• Invisible to software and system
59
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Memory Disambiguation Absent
Intel® Core™ Microarchitecture – Memory Sub-system
Load4 must WAIT until previous stores complete
Memory
Data Y
Data Z
Data W
Data X
Load2 Y
Store3 W
Store1 Y
Load4 X
60
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Memory Disambiguation Presented
Intel® Core™ Microarchitecture – Memory Sub-system
Loads can decouple from stores
Load4 can get its data WITHOUT waiting for stores
Memory
Data Y
Data Z
Data W
Data X
Load2 Y
Store3 W
Store1 Y
Load4 X
61
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Stores Forwarding
If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load
Intel® Core™ Microarchitecture – Memory Sub-system
Memory
Data Y
Load2 Y
Store1 YInternal Buffers
62
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Stores Forwarding: Aligned Store Cases
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16
load 32 bit load 32 bit load 32 bit load 32 bit
load 64 bit load 64 bit
load 128 bit
store 128 bit
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16 load 16 load 16 load 16
load 32 bit load 32 bit
load 64 bit
store 64 bit
ld 8 ld 8 ld 8 ld 8
load 16 load 16
load 32 bit
store 32 bit
ld 8 ld 8
load 16
store 16
63
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access / Stores Forwarding: Unaligned Cases
Note that unaligned store forward does not occur when the loadcrosses a cache line boundary
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
load 16‡ load 16 load 16 load 16
load 32 bit‡ load 32 bit
load 64 bit
store 64 bit
ld 8 ld 8 ld 8 ld 8
load 16‡ load 16
load 32 bit‡
store 32 bit
ld 8 ld 8
load 16‡
store 16
ld 8
ld 8 Store forwarded to load
No forwarding
‡: No forwarding if the loadcrosses a cache line boundary
Note: Unaligned 128-bit stores are issued as two 64-bit stores. This provides two alignments for store forwarding
64
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
65
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forInstruction Fetch and PreDecode
Avoid “Length Changing Prefixes” (LCPs)
• Affects instructions with immediate data or offset
• Operand Size Override (66H)
• Address Size Override (67H) [obsolete]
• LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16-byte boundary)
• The REX (EM64T) prefix (4xH) is not an LCP
• The REX prefix does lengthen the instruction by one byte, so useof the first eight general registers in EM64T is preferred
66
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forInstruction Queue
Includes a “Loop Stream Detector” (LSD)
• Potentially very high bandwidth instruction streaming
• A number of requirements to make use of the LSD
• Maximum of 18 instructions in up to four 16-byte packets
• No RET instructions (hence, little practical use for CALLs)
• Up to four taken branches allowed
• Most effective at 70+ iterations
• LSD is after PreDecode so there is no added cost for LCPs
• Trade-off LSD with conventional loop unrolling
67
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forDecode
Decoder issues up to 4 uOps for renaming/ allocation per clock
• This creates a trade off between more complex instruction uOps versus multiple simple instruction uOps
• For example, a single four uOp instruction is all that can be renamed/allocated in a single clock
• In some cases, multiple simple instructions may be a better choice than a single complex instruction
• Single uOp instructions allow more decoder flexibility
• For example, 4-1-1-1 can be decoded in one clock
• However, 2-2-2-1 takes three clocks to decode
68
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forExecution
Up to six uOps can be dispatched per clock
• “Store Data” and “Store Address” dispatch ports are combined on the block diagram
Up to four results can be written back per clock
Single clock latency operations are best
• Differing latency operations can create writeback conflicts
• Separate multiple-clock uOps with several single uOp instructions• Typical instructions here: ADC/SBB, RWM, CMOVcc
• In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility)
When equivalent, PS preferred to PD (LCP)
• For example, MOVAPS over MOVAPD, XORPS over XORPD
69
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forExecution (cont.)
Bypass register “access” preferred to register reads
Partial register accesses often lead to stalls
• Register size access that ‘conflicts’ with recent previous register write
• Partial XMM updates subject to dependency delays
• Partial flag stall can occur, too � much higher cost• Use TEST instruction between shift and conditional to prevent
• Common zeroing instructions (e.g., XOR reg,reg) don’t stall
Avoid bypass between execution domains
• For example: FP (ADDPS) and logical ops (PAND) on XMMn
Vectorization: careful packing/unpacking sequence
• Use MXCSR’s FZ and DAZ controls as appropriate
70
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing forMemory
Software prefetch instructions
• Can reach beyond a page boundary (including page walk)
• Prefetches only when it completes without an exception
General techniques to help these prefetchers
• Organize data in consecutive lines
• In general, increasing addresses are more easily prefetched
71
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Summary
What has been covered
• Notable features of Core® Micro-architecture
• Wide Dynamic Execution
• Advanced Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Power Efficient Support
• Core® Micro-architecture components
• Front End
• OOO execution core
• Memory sub-system
72
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
73
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Platform
Intel provides most of the silicon on any computer
Classical platform partition
• CPU – Computation
• MCH – high speed IO
• ICH – low speed IO
Graphics speed and memory latencies will require different partition
This presentation focuses on the core microarchitecture
PCI (IO)SATAUSB
KBRDothers
FSB
FSB
ICH
Legacy & Debug I/O
Core
Core
LLC
MEHD video
PCIeDisplay
PEG
Analog
DMI
DMIMCH
CPU
MEMDDR
TVout
Graphics
Wireless
74
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® 64 = Extending IA-32 to 64 Bit
Added to Intel XEONAdded to Intel XEON ™™ and Pentiumand Pentium ®® 4 Processor in 2004; today 4 Processor in 2004; today available in all main stream Intel IAavailable in all main stream Intel IA --32 processors 32 processors –– in particular in in particular in
all processors based on Intelall processors based on Intel ®® CoreCore ™™ ArchitectureArchitecture
Additional Registers8-SSE & 8-Gen PurposeAdditional RegistersAdditional Registers
88--SSE & 8SSE & 8--GenGen PurposePurpose
Double Precision (64-bit) Integer Support
Double Precision (64Double Precision (64 --bit) bit) Integer SupportInteger Support
Extended Memory Addressability
64-Bit Pointers, Registers
Extended Memory Extended Memory AddressabilityAddressability
6464--Bit Pointers, RegistersBit Pointers, Registers
++ ==With 64With 64 --Bit Bit Extension Extension
TechnologyTechnology
75
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® 64 - New Modes of Operation
16
1616
16
32
32
64
GPR Widt
h
32
32
64
Addr Size
Defaults
32
32
32
Operand Size
No
No
Yes
New Regs
No
No
Yes
RIP Rel.
No
Yes
Yes
64-bit IP
New Features
No
Legacy 32-bit or 16-bit
OS
Legacy Mode
(IA32 Mode)
NoCompatibility
Mode
Yes
New 64-bit
OS
64-bit Mode
Long Mode
Compile required
OS Req’d
Mode
76
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Registers : Extensions and Additions
R8
R9
R10
R11
R12
R13
R14
R15
ESPRSP
EDIRDI
ESIRSI
EBPRBP
EDXRDX
ECXRCX
EBXRBX
EAXRAX
63 32 31 0
XMM15
XMM14
XMM13
XMM12
XMM11
XMM10
XMM9
XMM8
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
EIPRIP
127 64 63 0
079
X87/MMX
77
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Registers : Availability in different modes
78
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
64-bit Mode of Operation
Default data size is 32-bits
• Override to 64-bits using new REX prefix
All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable
REX prefixes
• A family of 16 prefixed, encoded 0x40-0x4F
• Allows the use of general purpose registers as 64-bits
• Allows the use of new registers (like r8-r15)
Instructions that set a 32 bit register automatically zero extend the upper 32-bits
79
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
REX Prefix
A new instruction-prefix byte used in 64-bit mode
• Specify the new GPRs and SSE registers
• Specify a 64-bit operand size.
• Specify extended control registers (used by system software)
An instruction can only have one REX prefix and if used, must immediately precede the opcode or the two-byte opcode escape prefix .
The legacy instruction-size limit of 15 bytes still applies to instructions that contains a REX prefix.
80
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Physical and Linear Addressing
Linear Addressing
• Initial Intel® 64 implementation support 48 bits of Virtual addressing.
• Addresses are required to be in canonical form – bits 47 thru 63 must all be 1 or all be 0.
Physical Addressing
• Initial Netburst™ Intel® 64 implementation support 36 bit, today all current processors support 40bit at least
• Entries in page tables expanded for up to 52 bits of physical address.
81
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel®64 - Large Memory Considerations
Canonical addressing for 64 bit addresses
• Although the architecture now allows calculating flat addresses to 64 bits, today’s processors limit virtual addressing to 48 bits
• Canonical address definition: An address that has address bit 63 through 47 set to either all ones or all zeros
• Canonical addresses are a requirement
• Values for addresses that are not canonical will cause faults when put into locations expecting a valid address, such as segment registers
ReturnReturn
82
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Introducing SIMD: Single Instruction Multiple Data
++
Scalar processing
• traditional mode
• one operation producesone result
SIMD processing
• with SSE / SSE2
• one operation produces
multiple results
XX
YY
X + YX + Y
++
x3x3 x2x2 x1x1 x0x0
y3y3 y2y2 y1y1 y0y0
x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0
XX
YY
X + YX + Y
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE Registers
128
�� Eight 128Eight 128 --bit registersbit registers
�� Hold data only:Hold data only:
�� 4 x single FP numbers4 x single FP numbers
�� 2 x double FP numbers2 x double FP numbers
�� 128128--bit packed integersbit packed integers
�� Direct access to the registersDirect access to the registers
�� Use simultaneously with FP / Use simultaneously with FP / MMX TechnologyMMX Technology
MMX™ Technology / IA-FP Registers
8064
�� Eight 80/64Eight 80/64 --bit registersbit registers
�� Hold data onlyHold data only
�� Stack access to FP0..FP7Stack access to FP0..FP7
�� Direct access to MM0..MM7Direct access to MM0..MM7
�� No MMXNo MMX™™ Technology / FP Technology / FP interoperabilityinteroperability
IA-INT Registers
32
�� Fourteen 32Fourteen 32 --bit registersbit registers
�� Scalar data & addressesScalar data & addresses
�� Direct access to Direct access to regsregs
X86 Register SetsSSE-Registers introduced first in Pentium® 3
mm0mm0
mm7mm7
xmm0xmm0
xmm7xmm7
st0st0
st7st7
eaxeax
ediedi
……
84
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Beginning in 2008: ~50 new instructions in 13 groups
All function in 32-bit and 64-bit modes
Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D & 3D Imaging, Vectorizing Compiler Performance
New Instructions Added to Intel® Processors
5670
144
13
32
50
0
20
40
60
80
100
120
140
160
Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+
MMX™ Streaming SIMDExtensions (SSE)
Streaming SIMDExtensions 2 (SSE2)
Streaming SIMDExtensions 3 (SSE3)
Supplemental SSE3(SSSE3)
Future Intel instructionset extensions
350 250 180 90 65 45Process (nm)
~
Instruction Set Extensions
32
FutureSSE-4
45 nm
85
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE and SSE-2 Data Types
4x floats4x floatsSSE
16x bytes16x bytes
8x 168x 16--bit shortsbit shorts
4x 324x 32--bit integersbit integers
2x 642x 64--bit integersbit integers
1x 1281x 128--bit(!) integerbit(!) integer
2x doubles2x doubles
SSE-2
Copyright © 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-Instructions Set Extensions
Introduced by Pentium® 3 in 1999; now frequently called SSE-1
Only new data type supported: 4x32Bit (Single Precision) floating point data
Some 70 instructions
• Arithmetic, compare, convert operations on SSE SP FP data• PACKED, UNPACKED
• Data load/store
• Prefetch
• Extension of MMX
• Streaming Store (store without using cache in between)
• …
87
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE Sample: Branch Removal
R = (R = (AA < < BB)? )? CC : : DD //remember: everything packed
0.00.0A
B
0.00.0 --3.03.0 3.03.0
0.00.0 1.01.0 --5.05.0 5.05.0
cmpltcmplt
0000000000 1111111111 0000000000 1111111111
andandc3c3 c2c2 c1c1 c0c0
0000000000 c2c2 0000000000 c0c0
nandnandd3d3 d2d2 d1d1 d0d0
d3d3 0000000000 d1d1 0000000000
oror
d3d3 c2c2 d1d1 c0c0
Copyright © 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-2 Instructions Set Extensions
Introduced by Intel® Pentium®4 processor in 2000
Some 140 new instructions
Added double precision floating point data (2x64Bit) and all related instructions including conversion
Again some extensions to MMX
Added all possible combinations of integer data to SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related operations
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SIMD Single vs. SIMD Double
002222232330303131
SIMD SP FP Operand = 4 Elements
Element = SP FP Number
005151525262626363
SIMD DP FP Operand = 2 Elements
Element = DP FP Number
4 x Single Precision:4 x Single Precision:SSESSE--11
2 x Double Precision:2 x Double Precision:SSESSE--2 2
X3X3 X2X2 X1X1 X0X0
SS ExponentExponent SignificandSignificand
X1X1 X0X0
SS ExponentExponent SignificandSignificand
00127127
127127 00
90
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Sample for SSE-2: SIMD Double ↔↔↔↔ SIMD Int Conversion
SIMD Double � SIMD Int: conversion to two lower ints, two higher ints cleared
x1x1 x0x0
0000000000 0000000000 (int)x1(int)x1 (int)x0(int)x0
__m128d x;__m128i ix;ix = _mm_cvtpd_epi32(x);
???????? ???????? ix1ix1 ix0ix0
(double)x1(double)x1 (double)x0(double)x0
x = _mm_cvtepi32_pd(ix);
�� SIMD SIMD IntInt �������� SIMD Double: conversion from SIMD Double: conversion from two lower two lower intintss
91
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SIMD FP using AOS format*
Thread Synchronization
Video encoding
Complex arithmetic
FP to integer conversions
HADDPD, HSUBPD
HADDPS, HSUBPS
MONITOR, MWAIT
LDDQU
ADDSUBPD, ADDSUBPS,
MOVDDUP, MOVSHDUP,
MOVSLDUP
FISTTP
* Also benefits Complex and Vectorization
SSE3: No new Data Types but new Instructions
92
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Streaming SIMD Extensions 313 new instructions
Three have limited use for application performance improvement
• FISTTP - X87 to integer conversion (requires –longdouble switch)
• MONITOR/MWAIT - thread synchronization
• Available today in Ring 0 only; being used by newer Windows* and Linux* thread packages
The other ten have some potential for specifcapplication domains
93
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
SSE-3 Sample Complex Arithmetic: ADDSUBPS
ADDSUBPS OperandA OperandB
• OperandA (xmm register; 4 data elements)
• a3, a2, a1, a0
• OperandB (xmm reg. Or memory addr; 4 data elements)
• b3, b2, b1, b0
• Result (Stored in OperandA)
• a3+b3, a2-b2, a1+b1, a0-b0
__m128 _mm_addsub_ps(__m128 a, __m128 b)
a3 a2 a1 a0
a3+b3 a2-b2 a1+b1 a0-b0
Add Sub
b3 b2 b1 b0
AddSub
94
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Sample SSSE-3 Inst.: Byte Permute
PSHUFB mm, mm/m64
PSHUFB xmm, xmm/m128• A complete byte-granularity permutation
• The source operand is used as the control field (variable control)
• The destination operand gets permuted
• Each byte of the source field selects the origin of the corresponding destination byte
• Also includes force-byte-to-zero flag (bit 7)
0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01
0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00
0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01
srcsrc
destdest
destdest
Copyright © 2006, Intel Corporation. All rights reserved.
2001 PTE Engineering Enabling Conference
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Ways to SSE/SIMD programming
Coding using SSE/SSE2/3/4 assembler instructions• Very tedious (manually schedule) – discouraged: Don’t do it !
• E.g.: How do you exploit the benefits of having now 16 instead of 8 SSE registers for Intel® 64 without maintaining two versions ?
Intel® compiler’s C/C++ SIMD intrinsics• No need to take care of register allocation, scheduling etc
Intel® compiler’s C++ Vector Class Library• Use this if you are heavy into C++ classes
Vectorizer of Intel® C++ and Fortran Compilers• Recommended for most cases – easy and efficient
Use ready-to-go vectorized code from a library like Intel® Math Kernel Library (MKL)
96
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software CollegeCompiler Based VectorizationProcessor Specific
-xP,-axP
Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit mode) – including code generation for MMX, SSE, SSE2 and SSE-3
-xN-axN
Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2- depreciated switch: use xW instead
-axK-axK
Pentium® 3 compatible and Athlon XPprocessors including code generation for MMX and SSE
-xW-axW
Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, including code generation for MMX, SSE and SSE2
-xT,-axT
Intel® processors with MNI capability – Intel® Core™2 Duo processors (
Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE-3 and MNI
-xB-axB
Pentium® M processors including code generation for MMX, SSE and SSE-2
Linux*Generate Code and Optimize for
97
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.) New Instructions
Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and store them to the dst register.
PALIGNR mm, mm/m64, imm8
PALIGNR xmm, xmm/m128, imm8
A complete byte-granularity permutation, including force-to-zero flag.
PSHUFB mm, mm/m64
PSHUFB xmm, xmm/m128
Signed 16 bits multiply, return high bits.PMULHRSW mm, mm/m64
PMULHRSW xmm, xmm/m128
Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate)
PMADDUBSW mm, mm/m64
PMADDUBSW xmm, xmm/m128
Pairwise integer horizontal subtract + pack.phsubw/d/sw mm, mm/m64
phsubw/d/sw xmm, xmm/m128
Pairwise integer horizontal addition + pack.phaddw/d/sw mm, mm/m64
phaddw/d/sw xmm, xmm/m128
Per element, overwrite destination with absolute value of source.
pabsb/w/d mm, mm/m64
pabsb/w/d xmm, xmm/m128
Per element, if the source operand is negative, multiply the destination operand by -1.
psignb/w/d mm, mm/m64
psignb/w/d xmm, xmm/m128
DescriptionInstruction name
ReturnReturn
98
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Dependencies and Bypasses
“Read-after-Write” Dependency - 1 clock stall assuming register file can be written-through
add eax, ecx ���� eax F D E W
sub ebx, eax ���� ebx F D D E W
“E to D” Bypass - save clock penalty
add eax, ecx ���� eax F D E W
sub ebx, eax ���� ebx F D E W
Long Latency operations
Load [ecx+edi] ���� eax F D E E E W
add ebx, eax ���� ebx F D D D E W
99
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling
Given the code:
for (i=100, a=0; i>0; i--) a+=B[i];
Compiler would generate
• // eax initiated with zero, edi initiated with 100
loop: load B[edi] ���� ebx // read B[i] from memory
add eax, ebx ���� eax // a+=B[i]
add edi,-1 ���� edi // i-=1
jnz edi, loop
store eax ���� a // store result
100
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling (cont.)
load B[edi] ���� ebx F D E W
add eax,ebx ���� eax F D E W
add edi,-1 ���� edi F D E W
jnz edi, loop F D E W
store eax ���� a F D E W
xxx F D E W
load B[edi] ���� ebx F D E W
Only after branch Execute stage we know that next fetch was wrong
• Need to flush the pipe
• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC = 1)
• ‘Pipe break’ penalty = 2 clocks
• Adding a stage?: IPC = 0.57 ~14% slower!!!Prolonging the pipeline achieves higher frequencies however pipe break penalty increases! MUST solve the pipe break penalty problem!
101
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Branch Handling (cont.)
H/W can ‘learn’ about SW behavior• Same branch goes same direction in most cases• Learn branch address and target• Branch Target Buffer (BTB)
• Predict based on branch history, surrounding branch behavior, loop behavior.• We are at ~95% correct prediction.
• Looks in BTB while fetching instruction• Lee&Smith or Yeh&Patt algorithmsNew (and correct) pointer calculated in Fetch stage of branch
load B[edi] ���� ebx F D E Wadd eax,ebx ���� eax F D E Wadd edi,-1 ���� edi F D E Wjnz edi, loop F/P D E Wload B[edi] ���� ebx F D E W
102
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Pipeline Techniques
Limitations of the Typical Pipeline Scheme
• IPC is theoretically limited by 1
• Actually IPC is less than 1 because of long latency operations,stalls (e.g. cache miss), pipeline flushes (due to branch miss prediction) etc.
• Pipeline stages are frequently not balanced
• Cycle Time (Tc) is determined by the longest pipeline stage
Advanced Pipeline Techniques
• Super pipeline
• Super-scalar
103
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Pipeline Techniques (cont.)
Super pipeline: shorter stages allows higher frequency
Super-scalar: perform more in a single cycle
F1 F2 D1 D2 E1 E2 W1 W2F1 F2 D1 D2 E1 E2 W1 W2
F1 F2 D1 D2 E1 E2 W1 W2
F D E WF D E W
F D E WF D E W
104
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting stalls: Out Of Order Execution (OoO)
Instructions are executed based on “data flow” rather than program order (Tomasulo’s algorithm )
1. Instruction Fetch and Decode.
2. Instruction queue @ Reservation Station.
3. Instruction
• waits in the queue until all input operands are available
• leaves the queue before earlier, older instructions.
4. Instruction Execution
5. Results are queued.
6. Instruction Reorder and Writeback.
Avoid the stall thatoccurs on this
stage in an in-orderprocessor
105
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting stalls: Register Renaming
Creates new opportunities for OOO execution
• Eliminates Write-after-write (WAW) and Write-after-read (WAR) dependencies = hazards.
Architectural vs physical registers dispatch
MULTD F4,F2,F2 reads from F2
ADDD F2,F0,F6 writes to F2
MULTD F4,F2,F2
ADDD F8,F0,F6 (assume F8 is unused)
1.1. movmov eaxeax,, [m1][m1]
2.2. aadddd eaxeax, 2, 2
3.3. movmov [m2], [m2], eaxeax
4.4. movmov eaxeax,, [[mm3]3]
5.5. aadddd eaxeax, 4, 4
6.6. movmov [m4], [m4], eaxeax
4, 5, 6 4, 5, 6 cancan bebe executedexecuted inin parallelparallel withwith 1, 2, 31, 2, 3but after registers renaming only!!!
106
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Re-Order Buffer (ROB)
Mechanism for renaming and retirement
Table contains in-order instructions order instructions
• Instructions are entered in order
• Registers renamed by the entry number
• Once assigned: execution order unimportant
• After execution: entries marked
• An executed entry can be “retired” once all prior instruction have retired. That is: instruction have retired -
• Update “real registers real registers” with value of renamed regs
• Update memory
• Leave the ROB
107
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Reservation Station(s)
Pool(s) of all “not yet executed” instructions
Maintains operands status “ready / not-ready”
Each cycle, executed instructions make more operands “ready”
Instructions whose all operands are “ready” can be “dispatched”for execution
Dispatcher chooses which of the “ready” instructions will be executed next
108
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Fighting Stalls: Memory Order Buffer (MOB)
Idea - allow out of order among memory operations
Problem Memory dependencies cannot fully resolved statically (memory disambiguation)
Structure similar in concept to ROB
Every access is allocated an entry
Address & data (for stores) are updated when known
Load is checked against all previous stores: Load is checked against all previous stores
ReturnReturn
109
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)
Many buses are sized for worst case data
(x86 instruction of 15 bytes)(ALU can write-back 128 bits)
Improved Energy Efficiency
110
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
By splitting buses to dealwith varying data widths,
we can gain the performancebenefit of bus width while
maintaining C dynamiccloser to thinner buses
Improved Energy Efficiency
Intel® Core® Micro-architecture Notable Features (cont.)Intelligent Power Capability - Split Busses (core power feature)
111
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge refreshment
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
112
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
System Bus
2nd Level Cache 1st Level Cache (Data)
Bus Unit
Decode/IQ
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
Branch Prediction Unit
Intel® Core® Micro-architecture Overview
Front EndFront EndExecution CoreExecution Core
113
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core® Micro-architecture Drill-down
icachebranch
predictionunit
instructionqueue
MS
instructiondecode
predecode
registeralias table
ALLOC Re-Order Buffer
ReservationStation
integerFP
SIMD(3x)
load
storeaddress
storedata
memoryorderbuffer
datacacheunit
page miss handler
114
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Example Code to Be Used
…
addps xmm0, [EAX+16]
mulps xmm0, xmm0
movps [EAX+240], xmm0
cmp EAX, 100000
jge label
…
115
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge refreshment
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
116
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction preparation before executed
• Instruction Fetch Unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit
117
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction Fetch Unit
Instruction Queue
Instruction Decode Unit
Branch Prediction Unit
118
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Fetch Unit
Prefetches instructions that are likely to be executed
Caches frequently-used instructions
Predecodes and Buffers instructions
Intel® Core™ Microarchitecture – Front End
2nd Level Cache 1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
BTBs/Branch Prediction
Front EndFront EndExecution CoreExecution Core
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
119
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Fetch Unit (cont.)
I-Cache (Instruction Cache)
• 32 KBytes / 8-way / 64-byte line
• 16 aligned bytes fetched per cycle
ITLB (Instruction Translation Lookaside Buffer)
• 128 4k pages, 8 2M pages
Instruction Prefetcher
• 16-byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers
Instruction Pre-decoder
• Instruction Length Decode (predecode)• Avoid Length Changing Prefix, for example
• The REX (EM64T) prefix (4xH) is not an LCP
Intel® Core™ Microarchitecture – Front End
Avoid in loop:
MOV dx, 1234h
Instruction Prefixes (66H/67H)Instruction Prefixes (66H/67H) OpcodeOpcode ModRModR/M/M SIBSIB DisplacementDisplacement ImmediateImmediate
120
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction Fetch Unit
Instruction Queue
Instruction Decode Unit
Branch Prediction Unit
121
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Queue
Buffer between instruction pre-decode unit and decoder
• up to six predecoded instructions written per cycle
• 18 Instructions contained in IQ
• up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction
• Potential power saving
Intel® Core™ Microarchitecture – Front End
2nd Level Cache 1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
BTBs/Branch Prediction
Front EndFront EndExecution CoreExecution Core
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
122
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction Fetch Unit
Instruction Queue
Instruction Decode Unit
Branch Prediction Unit
123
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode
Decode the instructions into micro-ops
Ready for the execution in OOO core
Intel® Core™ Microarchitecture – Front End
2nd Level Cache 1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
BTBs/Branch Prediction
Front EndFront EndExecution CoreExecution Core
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
124
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking
Intel® Core™ Microarchitecture – Front End
125
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Decoders
Instructions converted to micro-ops (uops)
• 1-uop includes load+op, stores, indirect jump, RET...
4 decoders:1 “large” and 3 “small”
• All decoders handle “simple” 1-uop instructions
• One large decoder handles instructions up to 4 uops
All decoder working in parallel
• Four(+) instructions / cycle
Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, uCodeRom handles more complex)
Intel® Core™ Microarchitecture – Front End
126
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Code Sequence in Front End
these instructions tookmore than one fetchas they are 22 bytes
IQ buffers them together
all instructions are decodable by all decoders
CMP and adjacent JCCare “fused” into a singleuop. up to 5 instructions decoded per cycle
Intel® Core™ Microarchitecture – Front End
cmpcmp EAX, 100000 EAX, 100000
jnejne labellabel
movpsmovps [EAX+240], xmm0[EAX+240], xmm0
mulpsmulps xmm0, xmm0xmm0, xmm0addpsaddps xmm0, [EAX+16]xmm0, [EAX+16]
Large(dec0)
small(dec1)
small(dec2)
small(dec3)
cmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
IQ
127
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking
Intel® Core™ Microarchitecture – Front End
128
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Macro - Fusion
Roughly ~15% of all instructions are conditional branches.
Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction.
Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch.
Not supported in EM64T long mode
Intel® Core™ Microarchitecture – Front End
cmpjae eax, [mem], label
Scheduler
Execution
flags and target to Write back
BranchEval
129
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Queue
addps xmm0, [EAX+16]
dec0
Cycle 2
Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmp eax, 100000
dec1
dec2
dec3
jge label
movps [EAX+240], xmm0
Instruction Decode / Macro-Fusion Absent
Read four instructions from Instruction Queue
Each instruction gets decoded into separate uops
Enabling Example
for (int i=0; i<100000; i++) {
…
}
cmp eax, 100000
jge label
dec0
Intel® Core™ Microarchitecture – Front End
130
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Queue
addps xmm0, [EAX+16]
dec0Cycle 1
mulps xmm0, xmm0
mulps xmm0, xmm0
movps [EAX+240], xmm0
addps xmm0, [EAX+16]
cmpjae eax, 100000, label
dec1
dec2
dec3
movps [EAX+240], xmm0
Instruction Decode / Macro-Fusion Presented
Read five Instructions from Instruction Queue
Send fusable pair to single decoder
Single uop represents two instructions
Enabling Example
for (unsigned int i=0; i<100000; i++) {
…
}
cmp eax, 100000
jae label
Intel® Core™ Microarchitecture – Front End
131
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Macro – Fusion (cont.)
Benefits
• Reduces latency
• Increased renaming
• Increased retire bandwidth
• Increased virtual storage
• Power savings
Enabling Greater Performance & Enabling Greater Performance &
EfficiencyEfficiency
Intel® Core™ Microarchitecture – Front End
132
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking
Intel® Core™ Microarchitecture – Front End
133
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
Intel® Core™ Microarchitecture – Front End
134
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
std xmm0, [eax+240]
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240
Intel® Core™ Microarchitecture – Front End
st xmm0, [eax+240]
135
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking
Intel® Core™ Microarchitecture – Front End
136
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding)
ESP is calculated by dedicate logic
• No explicit Micro-Ops updating ESP
• Micro-Ops saving
• Power savingESPd=8
Decoder
0
Decoder
1
Decoder
N
4
PUSH EAX PUSH EDX POP EBX
0
Recovery
Information
.
.
.
…
Intel® Core™ Microarchitecture – Front End
137
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Front End
Instruction Fetch Unit
Instruction Queue
Instruction Decode Unit
Branch Prediction Unit
138
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Branch Prediction Unit
Allow executing instructions long before the branch outcome is decided
• Superset of Prescott / Pentium-M features
• One taken branch every other clock
• Branch predictions for 32 bytes at a time, twice the width of the fetch engine
Intel® Core™ Microarchitecture – Front End
2nd Level Cache 1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
BTBs/Branch Prediction
Front EndFront EndExecution CoreExecution Core
branchprediction
unit
MS
instructiondecode
icache
instructionqueue
predecode
139
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Branch Prediction Unit (cont.)
16-entry Return Stack Buffer (RSB)
Front end queuing of BPU lookups
Type of predictions
• Direct Calls and Jumps
• Indirect Calls and Jumps
• Conditional branches
Intel® Core™ Microarchitecture – Front End
140
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Branch Prediction Improvements
Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements:
Branch miss-predictions reduced by >20%
Indirect Branch Predictor Loop Detector
Intel® Core™ Microarchitecture – Front End
141
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations
142
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Core® Micro-architecture Execution Core
2nd Level Cache 1st Level Cache (Data)
IQ/ Decode
Instruction Fetch Unit
Execution Unit
Renamer/AllocatorBuffers(Retirement)
Scheduler
BTBs/Branch Prediction
Front EndFront End Execution CoreExecution Core
Accepted decoded u-ops, assign resources, execute and retire u-ops
• Renamer
• Reservation station (RS)
• Issue ports
• Execution Unit
integerFP
SIMD(3x)
load
storeaddress
storedata
registeralias table
ALLOC Re-Order Buffer
ReservationStation
143
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution Core Building Blocks
Ports (number)Ports (number)
2 Load2 Load3,4 Store3,4 Store
Memory SubMemory Sub--systemsystem
0,1,50,1,5SIMDSIMD
IntegerInteger
SIMD/IntegerSIMD/IntegerMULMUL
0,1,50,1,5IntegerInteger
0,1,50,1,5FloatingFloating
PointPointExecution UnitExecution Unit
ROBROB
RenamerRenamer
RSRS
Intel® Core™ Microarchitecture – Execution Core
144
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Rename and Resources
4 uops renamed / retired per clock
• one taken branch, any # of untaken
• one fxchg per cycle
Uops written to RS and ROB
• Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS
• RS waits for sources to arrive allowing OOO execution
• Registers not “in flight” read from ROB during RS write
Intel® Core™ Microarchitecture – Execution Core
registeralias table
ALLOC Re-Order Buffer
ReservationStation
145
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Issue Ports and Execution Units
6 dispatch ports from RS
• 3 execution ports • (shared for integer / fp / simd)
• load
• store (address)
• store (data)
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
FP data has one additional cycle bypass latency
• Do not mix SSE FP and SSE integer ops on same register
Intel® Core™ Microarchitecture – Execution Core
integerFP
SIMD(3x)
load
storeaddress
storedata
Avoid: Addps XMM0,XMM1Pand xmm0,xmm3Addps xmm2,xmm0
Better: Addps XMM0,XMM1Addps xmm2,xmm0Pand xmm0,xmm3
146
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
The Out Of Order
each uop only takes a single RS entry
load + add dispatches twice (load, then add)
mulps dispatches once when load + add to write back
sta + std dispatches twice
sta (address) can fire as early as possible
std must wait for mulps to write back
cmpjne dispatches only once (functionality is truly fused)
no dependency, can fire as early as it wants
Intel® Core™ Microarchitecture – Execution Core
cmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
RS
147
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Dispatching to OOO EXEIntel® Core™ Microarchitecture – Execution Core
RScmpjne EAX, 100000, labelsta_std [EAX+240], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
cmpjne EAX, 100000, labelsta_std [EAX+244], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
cmpjne EAX, 100000, labelsta_std [EAX+248], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
cmpjne EAX, 100000, labelsta_std [EAX+24C], xmm0mulps xmm0, xmm0, xmm0load_add xmm0, xmm0, [EAX+16]
5 GP (incl jmp)
4 STD
3 STA
2 Load
1 GP (incl FP add)
0 GP (incl FP mul)
148
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Advanced Memory Access
3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2
Miss Latencies
• L1 miss hits L2 ~ 10 cycles
• L2 miss, access to memory ~300 cycles (server/FBD)
• L2 miss, access to memory ~165 cycles (Desk/DDR2)• C step broadwater is reported to have ~50ns latency
Cache Bandwidth
• Bandwidth to cache ~ 8.5 bytes/cycle
Memory Bandwidth
• Desktop ~ 6 GB/sec/socket (linux)
• Server ~3.5 GB/sec/socket
Intel® Core™ Microarchitecture – Memory Sub-system
149
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Intel® Core™Microarchitecture
Use CMP = employ both Cores
• Go to multithreading!
Prefer SSE as much as possible. If you didn’t do it so far, vectorize the code now!!
• Intel Compiler has very good vectorization engine
Align data and data layout (sequential)
• To align use __declspec(align (16)) float a[1000];
150
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Optimizing for Intel® Core™Microarchitecture (advanced)
Use Intel VTune™ Performance Analyzer for performance problems revealing
• CPI
• Specific CPU events for Core-arch:
RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI, RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc-see VTune help
151
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Front End Issue DebuggingLook for Front End optimization only when code is FE bound
• Reservation station (RS) is the front end and allocation target
• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front end issue• If there are no issues in the FE the RS should be full above 30% of the time
Front End typical issues:
• Code is too big to fit in the L1:• When L2_IFETCH.SELF.MESI happens every 10-15 instructions • Code that could have been with CPI 1 will be around 2• 14 cycles penalty for L1 demand miss
• Average instruction size above 6 bytes• Happens typically with SSE code and more with EM64T• Can have impact only in case of otherwise excellent CPI
• Code with length changing prefix issues (LCP) • Penalty of 6 cycles or more • Look at ILD_STALL VTune event
FrontFront--End should not be the bottleneck. End should not be the bottleneck.
Focus on Front End issues only if it is the issue. Focus on Front End issues only if it is the issue.
152
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution micro architecture
The busiest port may determent the potential execution speed
Single clock latency operations are best
• Different latency operations can create writeback conflicts �Creating bubble in the port
Look at the dependency chains to see the potential parallelism
• Remember that the RS has only 32 entries and only those instructions are candidates for scheduling to the execution ports
• High RESOURCE_STALLS.RS_FULL percentage if the code is latency bound
• The ROB has 96 entries
• High RESOURCE_STALLS.ROB_FULL percentage only if
• Code has long latency instructions (L2 misses)
• Other code can be executed while waiting
Execution stage: The key for good performance.Execution stage: The key for good performance.
Focus on port utilization and dependency chains Focus on port utilization and dependency chains
153
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution micro architecture
The Divider is a big potential stall source
• DIV for the number Divide operations executed
• IDLE_DURING_DIV for number of cycles of no port issue while the diverter is busy
• Try to find some useful work to do in parallel with divide operations
Extra cycle latency for bypass betweenexecution domains
• For example: FP (ADDPS) and logicalops (PAND) on XMMn
• DELAYED_BYPASS.FP
• DELAYED_BYPASS.LOAD
• DELAYED_BYPASS.SIMD
Data Cache Unit
SIMD
Integer
integer /
SIMD
MUL
IntegerFloating
Point
load
store (address)
store (data)
dtlb
memory orderring
store forwarding
0,1,5 0,1,5 0,1,5
234
EXE
154
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Enhancements and Optimization Opportunities
IP Prefetcher
• Prefetches stride loads associated with the same IP• Uses History table
• Use VTune events to identify misses when expected prefetches
Memory Disambiguation
• Predicts when OK to fire load before preceding stores with unknown address• Misprediction triggers Pipeline flash and load restart
• Disambiguation is temporarily disabled if frequently fails
• LOAD_BLOCK.STA where Loads blocked by a preceding store with unknown address
• In case not to the same address:Possible reasons for not working: Address collision with other load(s)
155
Copyright © 2006, Intel Corporation. All rights reserved.
Intel® Processor Micro-architecture - Core® microarchitecture
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
4k Aliasing
• OOO engine can fire Load before preceding Store if not collides on the Store’s address• Address collision serializes execution
• Address checking uses only the last 12 bits (4K)• False blocking - if Load’s & Store’s addresses have 4KB offset
• e.g. accessing large, power of two, sized arrays in a loop
• Resolve 4K aliasing conflicts by changing memory layout
• VTune event LOAD_BLOCK.OVERLAP_STORE
Load block cases
• Increase the distance between the store and the dependant load, so that the store data/address is known at the time the load is dispatched • Store address unknown - LOAD_BLOCK.STA
• Loads blocked by a preceding store with unknown address
• Store data unknown - LOAD_BLOCK.STD
• Loads blocked by a preceding store with unknown data
• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE• This includes mainly uncacheable loads and split loads (loads that cross the cache
line boundary)
Other Opportunities for Performance Gain in the memory sub-system