next generation memory subsystem - hot chips · amd athlontm northbridge with 4x agp and next...
TRANSCRIPT
1
AMD AthlonTM Northbridge with 4x AGP andNext Generation Memory Subsystem
Chetana Keltcher, Jim Kelly, Ramani Krishnan,
John Peck, Steve Polzin, Sridhar Subramanian,
Fred Weber
Advanced Micro Devices, Sunnyvale, CA
* AMD Athlon was formerly code-named AMD-K7
2
Outline of the Talk
• Introduction
• Architecture
• Clocking and Gearbox
• Performance
• Silicon Statistics
• Conclusion
3
Introduction
• NorthBridge: “Electronic traffic cop” that directs data flowbetween the main memory and the rest of the system
• Bridge the gap between processor speed and memory speed– Higher bandwidth busses
• Example: AGP 2.0, EV6 and AMD Athlon system bus
– Better memory technology• Example: Double data rate SDRAM, RDRAM
4
CPU
CPU
200 MHz, 64 bitsScaleable to 400 MHz
DRAM
GraphicsDevice
SouthBridge
PCIDevices
IDE USB PrinterPort
SerialPort
66MHz, 32 bits(1x,2x,4x)
ISA Bus
33MHz, 32 bits
PCI Bus
AGP Bus
NorthBridge
100MHz for PC-100 SDRAM200MHz for DDR SDRAM533-800MHz for RDRAM
AMD AthlonSystem Bus
System Block Diagram
5
• Can support one or two AMD Athlon or EV6 processors
• 200MHz data rate (scaleable to 400MHz), 64-bit processor interface
• 33MHz, 32-bit PCI 2.2 compliant interface
• 66MHz, 32-bit AGP 2.0 compliant interface supports 1x, 2x and 4xdata transfer modes
• Versions for SDRAM, DDR SDRAM and RDRAM memory
• Single bit error correction and multiple bit error detection (ECC)
• Distributed Graphics Aperture Remapping Table (GART)
• Power management features including powerdown self-refresh ofSDRAM and PCI master grant suspend
Features of the AMD Athlon Northbridge
6
L2Cache
BIU0
BIU1
Optional
PLLs
MRO
PCI
AGP
APC
MCT
Memory
PCI Slots
AGPMaster
CFG
ControlData
GTW
L2Cache Different versions for
PC100 SDRAM, DDRSDRAM and RDRAM
Architecture
7
Bus Interface Unit (BIU)
• Processor interface; one per processor
• The AMD Athlon Front side bus:– high performance point-to-point bus
– non-multiplexed command, address, data and reply/snoopinterfaces
– double-data rate transfers on address and data busses
– split transaction bus: up to 24 outstanding operations per processor
– source synchronized clocking to compensate for PC boardpropagation delays
8
ProbeResponse
CommandQueue
Rcv
ReplyRead Q
ReplyMemoryWrite Q
ReplyPCI / APCWrite Q
TransactionAgent
Xmt Mem ReadFIFO
PCI / APCRead FIFO
DRAM
PCI / APC
MRO, PCI & APC
Probe Q
MemoryWrite FIFO
Probe DataFIFO
PCI / APCWrite FIFO
I/O
DRAMPCI / APC
AM
D A
thlo
nS
yste
m B
us
Other BIU, PCI & APC
Other BIU
Other BIU orPCI or APC
Done MRO, PCI & APC
BIU Block Diagram
9
Accelerated Graphics Port (AGP)
• 1x, 2x and 4x data transfer rates in pipelined and sideband addressing (SBA) modes
• Fast Write implementation for CPU to AGP master writetransfers
• Distributed GART mechanism using 3 fully associativetranslation lookaside buffers (TLBs)
• Dynamically compensated AGP 2.0 compliant I/Obuffers
• Independent R/W data buffers– Reordering of high priority over low priority transactions
10
AGP Block Diagram
A/DRecv
SBARecv
A/DXmit
Addr
SBASeq
GARTXlat
Seq
SDI
Arb
WbT
Write Fifo8x144
Read Fifo32x128
TLB Miss
AGPAddress
AGP I/O AGP DataFIFOs
AGP Read Data
AGP Write Data
Mem ReadData
MemReq
AXQ
WrQ
RdQ
Mem WriteData
AG
P B
us
AG
P B
us
AGPQueues
11
PCI Controller
• The PCI arbiter supports five external PCI masters plusthe SouthBridge
• PCI to memory traffic is coherent w.r.t. the processorcaches
• Consecutive processor cycles to sequential PCIaddresses are chained together into one burst PCI cycle
• Same controller block is instantiated to handle the AGPPCI Protocol (APC block)
12
Memory Request Organizer (MRO)
• Request crossbar responsible for scheduling memoryread and write requests from CPU, PCI and AGP
• Serves as the coherence point
• Requests are reordered to minimize page conflicts andmaximize page hits
• Anti-starvation mechanism by aging of entries
• Arbitration bypassed during idle conditions to improvelatency
13
SDRAM Memory Controller• Controls up to 3 PC100 SDRAM or 4 DDR SDRAM
DIMMs. 16, 64, 128 and 256Mb densities supported
• Open page policy with 4 banks open
• Peak bandwidth = 800MB/sec (PC100 SDRAM),1.6GB/sec (DDR)
MemoryRequestArbiter
Memorycontroller RAS, CAS, WE, CS
72 bitsECC logic
BIU read data
Read data
Write data
Mem WriteMem Read
AGP R/W
GART
14
Rambus Memory Controller
• One 16-bit RDRAM channel with upto 32 devices distributedacross 3 RIMMs. 64, 128 and 256Mb densities supported
• Rambus RMC uses a closed page policy, but can keep banksopen with special chained commands
• Memory address mapped to reduce adjacent bank conflicts
• Peak bandwidth = 1.6GB/sec using 800MHz RDRAMs
RMC
RDRAM channel533-800 MHz
RAC
RMD
RMH
144 bits
18 bits
MRO
15
Clocking and Gear Boxes
• Many clock domains have to be supported– 66 MHz peripheral (PCI and AGP) logic clock
– 66 / 133 / 266 MHz 1x / 2x / 4x AGP clock
– 66 / 100 / 133 MHz Core logic clock, AMD Athlon bus clockand SDRAM clock
• Gear box logic is used to synchronize data transfersbetween two clock domains
• The gearbox logic is correct by design for holdtimeacross clock domains
• The Rambus RAC synchronizes with the RMC using adifferent gearbox design
16
Flop
D Q
SlowCLK
CombinationalLogic Latch
D Q
Ctlmask
CombinationalLogic
G
FastCLK
LatchEn
Gear Box Module
Flop
D Q
FastCLK
Slow clock domain Fast clock domain
1
0
Slow to Fast Clock Domain
SlowCLK(66 MHz)
FastCLK(100 MHz)
Phase 2 Phase 2Phase 3 Phase 1
D0Data D1 D2
LatchEn
D0Latched Data D1 D2
CtlMask
Phase 1
Sample points on theFastCLK domain
17
Fast to Slow Clock Domain
SlowCLK(66 MHz)
FastCLK (100 MHz)
Phase 2 Phase 2Phase 3 Phase 1
D0Data D1
Phase 3
LatchEn
D0LatchedData D1 D4
D2 D3 D4
D3
Sample points on theFastCLK domain
Flop
D Q
FastCLK
CombinationalLogic Latch
D Q CombinationalLogic
G
FastCLK
LatchEn
Gear Box Module
Flop
D Q
SlowCLK
Fast clock domain Slow clock domain
0
1
18
Performance
• Bypassing:– Can bypass MRO and BIU under low system loads for 25%
reduction in latency of CPU reads from main memory
– Overall performance boost equivalent to ~1/2 CPU speed binon Ziff-Davis WinBench® benchmark
• SPECint95 and SPECfp95 benchmarks on AMD Athlonand Pentium® III– 512K L2 cache, Single PC100 128MB DIMM
– http://www.amd.com/products/cpg/athlon/benchmarks.html
19
Performance
AMD AthlonTM 600MHz[512K]
AMD Athlon 550MHz[512K]
III 550MHz[512K]
Pentium
146%
153%
Specfp_base95
109%
118%
Specint_base95
AMD AthlonTM 600MHz[512K]
AMD Athlon 550MHz[512K]
III 550MHz[512K]
Pentium100% 100%
Normalized Pentium III Performance = 1
RR
R
Benchmark System Configuration: Diamond 770 using nVidia TNT2 Ultra 150MHz core, 183MHz memory clock 32MB, WesternDigi tal Expert 41800, Single PC-100 128MB DIMM, SoundBlaster Live (Value) Audio, LinkSys HPN100 Ethernet card, Toshiba 6XDVD SD-M1212, Dual Boot Windows® 98 & Windows NT® 4.0 using Norton System Commander. Windows NT 4.0 is installed withSP4, Windows 98 with DX 6.1A build 2150, and nVidia TNT2 Ultra Driver Rev 1.81 under Windows 98 and Windows NT.
AMD AthlonTM processor-based system: Reference Motherboard Rev. B*, Bios Rev AFTB00-2, Bus Mastering EIDE Driver v1.03,AGP miniport v4.41.
Pentium® III processor-based system: ASUS P2B Rev 1.02, BIOS Rev 1008 beta 4, EIDE-BM Driver 5/11/98, AGP miniport 5/11/98.
* This motherboard is not commercially available at this time.
20
Silicon Statistics
ChipVersion
Tech &Voltage
Max CoreSpeed
Die Size(pad limited)
No. ofPins
SDRAM,1P, 2xAGP
0.35 µ,3.3V
100 MHz 107 mm2 492
SDRAM,2P, 2xAGP
0.35 µ,3.3V
100 MHz 130 mm2 656
DDR,1P, 4xAGP
0.25 µ,2.5V
133 MHz 133 mm2 553
DDR,2P, 4xAGP
0.25 µ,2.5V
133 MHz
RDRAM,1P, 4xAGP
0.25 µ,2.5V
133 MHz 107 mm2 492
21
Die photo: SDRAM, 2P, 2xAGP
MRO
PCI/APC
AGP
CFG
MCT
BIU0
BIU1
FY0
FY1
PLLApprox. 500K gates11.43x11.43mm2
22
Conclusions
• Low cost, high performance system solution for theAMD Athlon processor - EV6 in a PC
• Multiprocessing architecture for workstation and servermarkets
• Design provides for a high degree of concurrency whichoptimizes throughput under heavy system loads
• Support three memory sub-systems:– PC-100 SDRAM
– Double Data Rate SDRAM
– Direct Rambus SDRAM