title (verdana bold 18pt) · bobcat low power core bu l2 decode. 20 | bobcat | hot chips 2010...
TRANSCRIPT
![Page 1: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/1.jpg)
| Bobcat | Hot Chips 2010 1
“Bobcat ”AMD’s New Low Power x86 Core Architecture
Brad Burgess, AMD FellowChief Architect / Bobcat Core
August 24, 2010
![Page 2: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/2.jpg)
| Bobcat | Hot Chips 2010 2
Tw o x8 6 Cores Tuned for Target Markets
“Bulldozer”
Performance & Scalabilit y
“Bobcat ”
Flexible, Low Power & Small
Mainst ream Client and Server Markets
Low Pow erMarkets
Sm allDie Area
Cloud Clients Opt im ized
![Page 3: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/3.jpg)
| Bobcat | Hot Chips 2010 3
Bobcat Design Goals
A sm all, efficient , low power x86 core
Excellent perform ance
Synthesizable with sm all num ber of custom arrays
Easily Portable across process technologies
![Page 4: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/4.jpg)
| Bobcat | Hot Chips 2010 4
Feature Set
64-bit AMD64 x86 I SA
SI MD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A
Virtualizat ion
Support for m isaligned 128-bit data types
I nst ruct ion Based Sampling ( for dynam ic opt im izat ion)
C6 (with integrated power gat ing)
![Page 5: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/5.jpg)
| Bobcat | Hot Chips 2010 5
Micro - architecture Overview
Dual x86 inst ruct ion decode
Out-of-Order inst ruct ion execut ion
Dual COP ret irement
Complex m icroOPs
State of the art branch predict ion
Aggressive OOO load/ store engine w/ hazard predict ion
Advanced Virtualizat ion w/ nested page tables, ASI Ds and world switch accelerat ion
Low power C6 state w/ core level power gat ing and state save accelerat ion
![Page 6: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/6.jpg)
| Bobcat | Hot Chips 2010 6
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch Queue
![Page 7: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/7.jpg)
| Bobcat | Hot Chips 2010 7
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
I cache 32Kbyte
2-way set associat ive
64-byte line
Parity Protected
512/ 8 ent ry I TLB (4k/ 2m)
Fetch up to 32-bytes/ cycle
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch Queue
![Page 8: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/8.jpg)
| Bobcat | Hot Chips 2010 8
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
Branch Predictor : Predicts up to two
branches per cycle
Remembers branch inst ruct ion locat ions
Return Stack Address Predictor
I ndirect Dynam ic Address Predictor
State of the Art condit ion Predictor
Only necessary st ructures are clocked
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch Queue
![Page 9: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/9.jpg)
| Bobcat | Hot Chips 2010 9
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueDual x8 6 Decoder: Scans up to 22 bytes
Decodes up to two x86 inst ruct ions per cycle
The decoder can direct ly map 89% of x86 inst ruct ions to a single m icroOp, an addit ional 10% to a pair of m icroOps, and more complicated x86 inst ruct ions (< 1% ) are m icrocoded. (Dynam ic I nst ruct ion Counts)
![Page 10: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/10.jpg)
| Bobcat | Hot Chips 2010 1 0
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueI nteger Execut ion: A dual port integer
scheduler feeds two ALUs
A dual port address scheduler feeds a load address unit , and a store address unit .
Physical Register File uses maps and pointers to reduce power by m inim izing data copying/ movement .
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
Mul
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
![Page 11: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/11.jpg)
| Bobcat | Hot Chips 2010 1 1
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueFloat ing Point Unit : A cent ralized FP scheduler
feeds two 64-bit FP execut ion stacks
MMX and Logical units are replicated in both stacks
The FP Mul Unit can perform two SP mult iplies per cycle
The FP Add Unit can perform two SP addit ions per cycle
A physical register file is used to reduce power
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
![Page 12: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/12.jpg)
| Bobcat | Hot Chips 2010 1 2
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueData Cache: 32-Kbyte
8-way set associat ive
64-byte line
Parity Protected
Copyback
40/ 8 ent ry L1DTLB (4k/ 2m)
512/ 64 ent ry L2DTLB (4k/ 2m)
Advanced 8-st ream prefetcher
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
![Page 13: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/13.jpg)
| Bobcat | Hot Chips 2010 1 3
BobcatMicro-Architecture
BU512KB
L2CACHE To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueOut - of - Order Load
Store Unit : Loads bypassing loads
Loads bypassing stores
Stores bypassing loads
Bypass t racking and dependency correct ion
Hazard predictor
Fast store forwarding
Fast cr it ical word fill forwarding
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
![Page 14: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/14.jpg)
| Bobcat | Hot Chips 2010 1 4
BobcatMicro-Architecture
To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueL2 Cache: 512Kbyte
16-way set associat ive
64 byte lines
ECC Protected
Half speed clocking for power reduct ion
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
BU512KB
L2CACHE
![Page 15: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/15.jpg)
| Bobcat | Hot Chips 2010 1 5
BobcatMicro-Architecture
To/from Northbridge
ITLB Branch Predictor
Branch Locator ConditionPredictor
Return Stack Dynamic Target
32KBICACHE
Fetch QueueBus Unit : 8-outstanding data
accesses
2-outstanding fetch accesses
Evict ion Buffers
Fill Buffers
Write combining buffers
Coherency management
uCode
ROBInt Rename
Dual x86 Decoder
Instr Queue
ALU SAGUALU
Mul
Int PRF
LAGU
SchedulerScheduler
MMX Alu MMX Alu
FPAdd FPMul
FP PRF
FP Logical FP Logical
St ConvIntMul
FP Rename
FP Decode
FP Sched
LdStUnit
32KBDCACHE
DTLB
Table Walker
Prefetch
BU512KB
L2CACHE
![Page 16: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/16.jpg)
| Bobcat | Hot Chips 2010 1 6
EXEEXE
Bobcat Pipeline0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2
Branch Mispredict Latency13-cycles
Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback
uCodeROM
MDecFetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5
Transit FpDec RegRen Schedule RegRead EXE AGU DC1 DC2Writeback
Load Use LatencyL1 hit : 3-cycles
Transit L2Tag L2Data
Load Use LatencyL2 hit : 17-cycles
![Page 17: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/17.jpg)
| Bobcat | Hot Chips 2010 1 7
Core Floor Plan
Inst TLB/Tag
Instruction Cache
Branch Predict
UcodeROM
Test/Debug
X86 Decode
Floating Point Unit
Data L2 TLB
Bus Unit
L2 Sub Array
L2 TAG
ROB
Integer Unit
Load Store Unit
Data Tag/TLB
Data Cache
![Page 18: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/18.jpg)
| Bobcat | Hot Chips 2010 1 8
Pow er Reduct ion Use of physical Register files
Extensive use of non-shift ing queues with pointers
Fine grain clock gat ing
I ntegrated Core Power Gat ing
Only needed arrays are clocked – i.e. Dtag hit before Dcache read
– Predict ing the type of branch then clocking the appropriate predictor(s)
Elim inat ion of inst ruct ion m arker bits in the I cache
Finding the knee of the curve (scrut inize perform ance gains against power costs)
Polishing speed paths to raise the Vt m ix and reduce leakage
![Page 19: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/19.jpg)
| Bobcat | Hot Chips 2010 1 9
Bobcat Core OverviewAdvanced Micro - architecture Dual x86 Decode Advanced Branch Predictor Full OOO inst ruct ion execut ion Full OOO load/ store engine High Performance Float ing Point AMD64 64-bit I SA SSE1,2,3, SSSE3 I SA Secure Vir tualizat ion 32kb L1s, 512kb L2
Low Pow er Design Power Opt im ized Execut ion Micro-architecture that m inim izes data movement
and unnecessary reads Clock gat ing, Power gat ing System Low Power States
Sm all Core Area efficient balance of high performance and low
power
APipe
MPipe
FP Scheduler
ICACHE
DCACHE
IPipe
StorePipe
AddressScheduler
IPipe
LoadPipe
IntegerScheduler
Fetch
Bobcat Low
Pow erCore
BU
L2
Decode
![Page 20: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half](https://reader034.vdocuments.site/reader034/viewer/2022042411/5f2994d95df682201a59f321/html5/thumbnails/20.jpg)
| Bobcat | Hot Chips 2010 2 0
Sum m ary
Est im ated 90% of the perform ance of today’s m ainst ream notebook CPU in half the area*
Sub-one wat t capable
Highly portable across designs and m anufactur ing technologies
*Based on internal AMD modeling using benchmark simulations