title (verdana bold 18pt) · bobcat low power core bu l2 decode. 20 | bobcat | hot chips 2010...

20
| Bobcat | Hot Chips 2010 1 “Bobcat” AMD’s New Low Power x86 Core Architecture Brad Burgess, AMD Fellow Chief Architect / Bobcat Core August 24, 2010

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1

“Bobcat ”AMD’s New Low Power x86 Core Architecture

Brad Burgess, AMD FellowChief Architect / Bobcat Core

August 24, 2010

Page 2: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 2

Tw o x8 6 Cores Tuned for Target Markets

“Bulldozer”

Performance & Scalabilit y

“Bobcat ”

Flexible, Low Power & Small

Mainst ream Client and Server Markets

Low Pow erMarkets

Sm allDie Area

Cloud Clients Opt im ized

Page 3: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 3

Bobcat Design Goals

A sm all, efficient , low power x86 core

Excellent perform ance

Synthesizable with sm all num ber of custom arrays

Easily Portable across process technologies

Page 4: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 4

Feature Set

64-bit AMD64 x86 I SA

SI MD extensions: SSE1, SSE2, SSE3, SSSE3, SSE4A

Virtualizat ion

Support for m isaligned 128-bit data types

I nst ruct ion Based Sampling ( for dynam ic opt im izat ion)

C6 (with integrated power gat ing)

Page 5: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 5

Micro - architecture Overview

Dual x86 inst ruct ion decode

Out-of-Order inst ruct ion execut ion

Dual COP ret irement

Complex m icroOPs

State of the art branch predict ion

Aggressive OOO load/ store engine w/ hazard predict ion

Advanced Virtualizat ion w/ nested page tables, ASI Ds and world switch accelerat ion

Low power C6 state w/ core level power gat ing and state save accelerat ion

Page 6: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 6

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch Queue

Page 7: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 7

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

I cache 32Kbyte

2-way set associat ive

64-byte line

Parity Protected

512/ 8 ent ry I TLB (4k/ 2m)

Fetch up to 32-bytes/ cycle

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch Queue

Page 8: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 8

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

Branch Predictor : Predicts up to two

branches per cycle

Remembers branch inst ruct ion locat ions

Return Stack Address Predictor

I ndirect Dynam ic Address Predictor

State of the Art condit ion Predictor

Only necessary st ructures are clocked

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch Queue

Page 9: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 9

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueDual x8 6 Decoder: Scans up to 22 bytes

Decodes up to two x86 inst ruct ions per cycle

The decoder can direct ly map 89% of x86 inst ruct ions to a single m icroOp, an addit ional 10% to a pair of m icroOps, and more complicated x86 inst ruct ions (< 1% ) are m icrocoded. (Dynam ic I nst ruct ion Counts)

Page 10: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 0

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueI nteger Execut ion: A dual port integer

scheduler feeds two ALUs

A dual port address scheduler feeds a load address unit , and a store address unit .

Physical Register File uses maps and pointers to reduce power by m inim izing data copying/ movement .

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

Mul

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

Page 11: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 1

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueFloat ing Point Unit : A cent ralized FP scheduler

feeds two 64-bit FP execut ion stacks

MMX and Logical units are replicated in both stacks

The FP Mul Unit can perform two SP mult iplies per cycle

The FP Add Unit can perform two SP addit ions per cycle

A physical register file is used to reduce power

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

Page 12: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 2

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueData Cache: 32-Kbyte

8-way set associat ive

64-byte line

Parity Protected

Copyback

40/ 8 ent ry L1DTLB (4k/ 2m)

512/ 64 ent ry L2DTLB (4k/ 2m)

Advanced 8-st ream prefetcher

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

Page 13: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 3

BobcatMicro-Architecture

BU512KB

L2CACHE To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueOut - of - Order Load

Store Unit : Loads bypassing loads

Loads bypassing stores

Stores bypassing loads

Bypass t racking and dependency correct ion

Hazard predictor

Fast store forwarding

Fast cr it ical word fill forwarding

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

Page 14: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 4

BobcatMicro-Architecture

To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueL2 Cache: 512Kbyte

16-way set associat ive

64 byte lines

ECC Protected

Half speed clocking for power reduct ion

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

BU512KB

L2CACHE

Page 15: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 5

BobcatMicro-Architecture

To/from Northbridge

ITLB Branch Predictor

Branch Locator ConditionPredictor

Return Stack Dynamic Target

32KBICACHE

Fetch QueueBus Unit : 8-outstanding data

accesses

2-outstanding fetch accesses

Evict ion Buffers

Fill Buffers

Write combining buffers

Coherency management

uCode

ROBInt Rename

Dual x86 Decoder

Instr Queue

ALU SAGUALU

Mul

Int PRF

LAGU

SchedulerScheduler

MMX Alu MMX Alu

FPAdd FPMul

FP PRF

FP Logical FP Logical

St ConvIntMul

FP Rename

FP Decode

FP Sched

LdStUnit

32KBDCACHE

DTLB

Table Walker

Prefetch

BU512KB

L2CACHE

Page 16: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 6

EXEEXE

Bobcat Pipeline0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2

Branch Mispredict Latency13-cycles

Dec0 Dec1 Dec2 Pack FDec Dispatch Schedule RegRead ALU Writeback

uCodeROM

MDecFetch0 Fetch1 Fetch2 Fetch3 Fetch4 Fetch5

Transit FpDec RegRen Schedule RegRead EXE AGU DC1 DC2Writeback

Load Use LatencyL1 hit : 3-cycles

Transit L2Tag L2Data

Load Use LatencyL2 hit : 17-cycles

Page 17: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 7

Core Floor Plan

Inst TLB/Tag

Instruction Cache

Branch Predict

UcodeROM

Test/Debug

X86 Decode

Floating Point Unit

Data L2 TLB

Bus Unit

L2 Sub Array

L2 TAG

ROB

Integer Unit

Load Store Unit

Data Tag/TLB

Data Cache

Page 18: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 8

Pow er Reduct ion Use of physical Register files

Extensive use of non-shift ing queues with pointers

Fine grain clock gat ing

I ntegrated Core Power Gat ing

Only needed arrays are clocked – i.e. Dtag hit before Dcache read

– Predict ing the type of branch then clocking the appropriate predictor(s)

Elim inat ion of inst ruct ion m arker bits in the I cache

Finding the knee of the curve (scrut inize perform ance gains against power costs)

Polishing speed paths to raise the Vt m ix and reduce leakage

Page 19: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 1 9

Bobcat Core OverviewAdvanced Micro - architecture Dual x86 Decode Advanced Branch Predictor Full OOO inst ruct ion execut ion Full OOO load/ store engine High Performance Float ing Point AMD64 64-bit I SA SSE1,2,3, SSSE3 I SA Secure Vir tualizat ion 32kb L1s, 512kb L2

Low Pow er Design Power Opt im ized Execut ion Micro-architecture that m inim izes data movement

and unnecessary reads Clock gat ing, Power gat ing System Low Power States

Sm all Core Area efficient balance of high performance and low

power

APipe

MPipe

FP Scheduler

ICACHE

DCACHE

IPipe

StorePipe

AddressScheduler

IPipe

LoadPipe

IntegerScheduler

Fetch

Bobcat Low

Pow erCore

BU

L2

Decode

Page 20: Title (Verdana Bold 18pt) · Bobcat Low Power Core BU L2 Decode. 20 | Bobcat | Hot Chips 2010 Summary Estimated 90% of the performance of today’s mainstream notebook CPU in half

| Bobcat | Hot Chips 2010 2 0

Sum m ary

Est im ated 90% of the perform ance of today’s m ainst ream notebook CPU in half the area*

Sub-one wat t capable

Highly portable across designs and m anufactur ing technologies

*Based on internal AMD modeling using benchmark simulations