xbox 360 system architecture

250272-1732/06/$20.00 © 2006 IEEE Published by the IEEE Computer Society

Microsoft’s Xbox 360 game consoleis the first of the latest generation of game con-soles. Historically, game console architectureand design implementations have providedlarge discrete jumps in system performance,approximately at five-year intervals. Over thelast several generations, game console systemshave increasingly become graphics supercom-puters in their own right, particularly at thelaunch of a given game console generation.

The Xbox 360, pictured in Figure 1, containsan aggressive hardware architecture and imple-mentation targeted at game console workloads.The core silicon implements the productdesigners’ goal of providing game developers ahardware platform to implement their next-gen-eration game ambitions. The core chips includethe standard conceptual blocks of CPU, graph-ics processing unit (GPU), memory, and I/O.Each of these components and their intercon-nections are customized to provide a user-friendly game console product.

Design principlesOne of the Xbox 360’s main design princi-

ples is the next-generation gaming principle—that is, a new game console must provide valueto customers for five to seven years. Thus, asfor any true next-generation game consolehardware, the Xbox 360 delivers a huge discretejump in hardware performance for gaming.

The Xbox 360 hardware design team had

to translate the next-generation gaming prin-ciple into useful feature requirements andnext-generation game workloads. For thegame workloads, the designers’ direction camefrom interaction with game developers,including game engine developers, middle-ware developers, tool developers, API and dri-ver developers, and game performanceexperts, both inside and outside Microsoft.

One key next-generation game featurerequirement was that the Xbox 360 systemmust implement a 720p (progressive scan)pervasive high-definition (HD), 16:9 aspectratio screen in all Xbox 360 games. This fea-ture’s architectural implication was that theXbox 360 required a huge, reliable fill rate.

Another design principle of the Xbox 360architecture was that it must be flexible to suitthe dynamic range of game engines and gamedevelopers. The Xbox 360 has a balancedhardware architecture for the software gamepipeline, with homogeneous, reallocatablehardware resources that adapt to differentgame genres, different developer emphases,and even to varying workloads within a frameof a game. In contrast, heterogeneous hard-ware resources lock software game pipelineperformance in each stage and are not reallo-catable. Flexibility helps make the design“futureproof.” The Xbox 360’s three CPUcores, 48 unified shaders, and 512-MbyteDRAM main memory will enable developers

Jeff AndrewsNick Baker

Microsoft Corp.

THIS ARTICLE COVERS THE XBOX 360’S HIGH-LEVEL TECHNICAL

REQUIREMENTS, A SHORT SYSTEM OVERVIEW, AND DETAILS OF THE CPU AND

THE GPU. THE AUTHORS DESCRIBE THEIR ARCHITECTURAL TRADE-OFFS AND

SUMMARIZE THE SYSTEM’S SOFTWARE PROGRAMMING SUPPORT.

XBOX 360 SYSTEM ARCHITECTURE

to create innovative games for the next five toseven years.

A third design principle was programma-bility; that is, the Xbox 360 architecture mustbe easy to program and develop software for.The silicon development team spent muchtime listening to software developers (we arehardware folks at a software company, afterall). There was constant interaction and iter-ation with software developers at the verybeginning of the project and all along thearchitecture and implementation phases.

This interaction had an interesting dynam-ic. The software developers weren’t shy abouttheir hardware likes and dislikes. Likewise, thehardware team wasn’t shy about where next-generation hardware architecture and designwere going as a result of changes in siliconprocesses, hardware architecture, and systemdesign. What followed was further iterationon planned and potential workloads.

An important part of Xbox 360 pro-grammability is that the hardware must pre-

sent the simplest APIs and programmingmodels to let game developers use hardwareresources effectively. We extended pro-gramming models that developers liked.Because software developers liked the firstXbox, using it as a working model was nat-ural for the teams. In listening to developers,we did not repackage or include hardwarefeatures that developers did not like, eventhough that may have simplified the hard-ware implementation. We considered thesoftware tool chain from the very beginningof the project.

Another major design principle was that theXbox 360 hardware be optimized for achiev-able performance. To that end, we designed ascalable architecture that provides the great-est usable performance per square millimeterwhile remaining within the console’s systempower envelope.

As we continued to work with game devel-opers, we scaled chip implementations toresult in balanced hardware for the softwaregame pipeline. Examples of higher-levelimplementation scalability include the num-ber of CPU cores, the number of GPUshaders, CPU L2 size, bus bandwidths, andmain memory size. Other scalable items rep-resented smaller optimizations in each chip.

Hardware designed for gamesFigure 2 shows a top-level diagram of the

Xbox 360 system’s core silicon components.The three identical CPU cores share an 8-wayset-associative, 1-Mbyte L2 cache and run at3.2 GHz. Each core contains a complement offour-way single-instruction, multiple data(SIMD) vector units.1 The CPU L2 cache,cores, and vector units are customized forXbox 360 game and 3D graphics workloads.

The front-side bus (FSB) runs at 5.4 Gbit/pin/s, with 16 logical pins in eachdirection, giving a 10.8-Gbyte/s read and a10.8-Gbyte/s write bandwidth. The busdesign and the CPU L2 provide added sup-port that allows the GPU to read directly fromthe CPU L2 cache.

As Figure 2 shows, the I/O chip supportsabundant I/O components. The Xbox mediaaudio (XMA) decoder, custom-designed byMicrosoft, provides on-the-fly decoding of alarge number of compressed audio streams inhardware. Other custom I/O features include

26

HOT CHIPS 17

IEEE MICRO

Figure 1. Xbox 360 game console and wireless controller.

the NAND flash controller and the systemmanagement controller (SMC).

The GPU 3D core has 48 parallel, unifiedshaders. The GPU also includes 10 Mbytes ofembedded DRAM (EDRAM), which runs at256 Gbytes/s for reliable frame and z-bufferbandwidth. The GPU includes interfacesbetween the CPU, I/O chip, and the GPUinternals.

The 512-Mbyte unified main memory con-trolled by the GPU is a 700-MHz graphics-double-data-rate-3 (GDDR3) memory,which operates at 1.4 Gbit/pin/s and providesa total main memory bandwidth of 22.4Gbytes/s.

The DVD and HDD ports are serial ATA(SATA) interfaces. The analog chip drives theHD video out.

CPU chipFigure 3 shows the CPU chip in greater

detail. Microsoft’s partner for the Xbox 360CPU is IBM. The CPU implements the Pow-erPC instruction set architecture,2-4 with theVMX SIMD vector instruction set (VMX128)customized for graphics workloads.

The shared L2 allows fine-grained, dynamicallocation of cache lines between the six threads.Commonly, game workloads significantly varyin working-set size. For example, scene man-agement requires walking larger, random-miss-dominated data structures, similar to databasesearches. At the same time, audio, Xbox proce-dural synthesis (described later), and many othergame processes that require smaller working setscan run concurrently. The shared L2 allowsworkloads needing larger working sets to allo-cate significantly more of the L2 than would beavailable if the system used private L2s (of thesame total L2 size) instead.

The CPU core has two-per-cycle, in-orderinstruction issuance. A separate vector/scalarissue queue (VIQ) decouples instructionissuance between integer and vector instruc-tions for nondependent work. There are twosymmetric multithreading (SMT),5 fine-grained hardware threads per core. The L1caches include a two-way set-associative, 32-Kbyte L1 instruction cache and a four-wayset-associative, 32-Kbyte L1 data cache. Thewrite-through data cache does not allocatecache lines on writes.

27MARCH–APRIL 2006

Videoout

10 MbytesEDRAM Analog

chip

CPU

GPU

Video out

Memory

512 MbyteDRAM

Core 0 Core 1 Core 2

L1D L1I L1D L1I L1D L1I

I/Ochip

XM

A d

ecod

erS

MC

MC

1M

C0

1 Mbyte L2

BIU/IO interface

3D core

DVD (SATA)

HDD port (SATA)

Front controllers (2 USB)

Wireless controllers

MU ports (2 USB)

Rear panel USB

Ethernet

IR

Audio out

Flash

System control

BIUMC

HDDMUIR

SMCXMA

Bus interface unitMemory controllerHard disk driveMemory unitInfrared receiverSystem management controllerXbox media audio

Figure 2. Xbox 360 system block diagram.

The integer execution pipelines includebranch, integer, and load/store units. Inaddition, each core contains an IEEE-754-compliant scalar floating-point unit (FPU),which includes single- and double-precisionsupport at full hardware throughput of oneoperation per cycle for most operations.Each core also includes the four-way SIMDVMX128 units: floating-point (FP), per-mute, and simple. As the name implies, theVMX128 includes 128 registers, of 128 bitseach, per hardware thread to maximizethroughput.

The VMX128 implementation includes anadded dot product instruction, common in

graphics applications. The dot productimplementation adds minimal latency to amultiply-add by simplifying the rounding ofintermediate multiply results. The dot prod-uct instruction takes far less latency than dis-crete instructions.

Another addition we made to the VMX128was direct 3D (D3D) compressed data for-mats,6-8 the same formats supported by theGPU. This allows graphics data to be gener-ated in the CPU and then compressed beforebeing stored in the L2 or memory. Typical useof the compressed formats allows an approx-imate 50 percent savings in required band-width and memory footprint.

28

HOT CHIPS 17

IEEE MICRO

UncachedUnit2

UncachedUnit2

Core 2

L1I32

KbytesL1D32

Kbytes

Instruction unit

Branch VIQ Int Load/Store

VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Core 1

L1I32

KbytesL1D32

Kbytes

Instruction unit


VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Core 0

L1I32

KbytesL1D32

Kbytes

Instruction unit


VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Node crossbar/queuing

Bus interface

Front side bus (FSB)

L2directory

L2directory

UncachedUnit2 L2 dataPIC

Test,debug,clocks,

temperaturesensor.

L2

VSUPermSimpMMU

IntPIC

FPUVIQ

Vector/scalar unitPermuteSimpleMain-memory unitIntegerProgrammable interrupt controllerFloating point unitVector/scalar issue queue

Figure 3. Xbox 360 CPU block diagram.

CPU data streamingIn the Xbox, we paid considerable atten-

tion to enabling data-streaming workloads,which are not typical PC or server workloads.We added features that allow a given CPUcore to execute a high-bandwidth workload(both read and write, but particularly write),while avoiding thrashing its own cache andthe shared L2.

First, some features shared among the CPUcores help data streaming. One of these is 128-byte cache line sizes in all the CPU L1 and L2caches. Larger cache line sizes increase FSBand memory efficiency. The L2 includes acache-set-locking functionality, common inembedded systems but not in PCs.

Specific features that improve streamingbandwidth for writes and reduce thrashinginclude the write-through L1 data caches.Also, there is no write allocation of L1 datacache lines when writes miss in the L1 datacache. This is important for write streamingbecause it keeps the L1 data cache from beingthrashed by high bandwidth transient write-only data streams.

We significantly upgraded write gatheringin the L2. The shared L2 has an uncached unitfor each CPU core. Each uncached unit hasfour noncached write-gathering buffers thatallow multiple streams to concurrently gath-er and dump their gathered payloads to theFSB yet maintain very high uncached write-streaming bandwidth.

The cacheable write streams are gathered byeight nonsequential gathering buffers per CPUcore. This allows programming flexibility in thewrite patterns of cacheable very high bandwidthwrite streams into the L2. The write streams canrandomly write within a window of a few cachelines without the writes backing up and caus-ing stalls. The cacheable write-gathering bufferseffectively act as a bandwidth compressionscheme for writes. This is because the L2 dataarrays see a much lower bandwidth than the rawbandwidth required by a program’s store pat-tern, which would have low utilization of theL2 cache arrays. Data transformation workloadscommonly don’t generate the data in a way thatallows sequential write behavior. If the writegathering buffers were not present, softwarewould have to effectively gather write data inthe register set before storing. This would put alarge amount of pressure on the number of reg-

isters and increase latency (and thus through-put) of inner loops of computation kernels.

We applied similar customization to readstreaming. For each CPU core, there are eightoutstanding loads/prefetches. A customprefetch instruction, extended data cacheblock touch (xDCBT), prefetches data, butdelivers to the requesting CPU core’s L1 datacache and never puts data in the L2 cache asregular prefetch instructions do. This modifi-cation seems minor, but it is very importantbecause it allows higher bandwidth readstreaming workloads to run on as manythreads as desired without thrashing the L2cache. Another option we considered for readstreaming would be to lock a set of the L2 perthread for read streaming. In that case, if a userwanted to run four threads concurrently, halfthe L2 cache would be locked down, hurtingworkloads requiring a large L2 working-setsize. Instead, read streaming occurs throughthe L1 data cache of the CPU core on whichthe given thread is operating, effectively giv-ing private read streaming first in, first out(FIFO) area per thread.

A system feature planned early in the Xbox360 project was to allow the GPU to directlyread data produced by the CPU, with the datanever going through the CPU cache’s back-ing store of main memory. In a specific caseof this data streaming, called Xbox procedur-al synthesis (XPS), the CPU is effectively adata decompressor, procedurally generatinggeometry on-the-fly for consumption by theGPU 3D core. For 3D games, XPS allows afar greater amount of differentiated geometrythan simple traditional instancing allows,which is very important for filling large HDscreen worlds with highly detailed geometry.

We added two features specifically to sup-port XPS. The first was support in the GPUand the FSB for a 128-byte GPU read fromthe CPU. The other was to directly lowercommunication latency from the GPU backto the CPU by extending the GPU’s tailpointer write-back feature.

Tail pointer write-back is a method of con-trolling communication from the GPU to theCPU by having the CPU poll on a cacheablelocation, which is updated when a GPUinstruction writes an update to the pointer.The system coherency scheme then updatesthe polling read with the GPU’s updated


pointer value. Tail write-backs reduce com-munication latency compared to usinginterrupts. We lowered GPU-to-CPU com-munication latency even further by imple-menting the tail pointer’s backing-store targeton the CPU die. This avoids the round-tripfrom CPU to memory when the GPU point-er update causes a probe and castout of theCPU cache data, requiring the CPU to refetchthe data all the way from memory. Instead therefetch never leaves the CPU die. This lowerlatency translates into smaller streamingFIFOs in the L2’s locked set.

A previously mentioned feature very impor-tant to XPS is the addition of D3D com-pressed formats that we implemented in both

the CPU and the GPU. To get an idea of thisfeature’s usefulness, consider this: Given a typ-ical average of 2:1 compression and an XPS-targeted 9 Gbytes/s FSB bandwidth, the CPUcores can generate up to 18 Gbytes/s of effec-tive geometry and other graphics data andship it to the GPU 3D core. Main memorysees none of this data traffic (or footprint).

CPU cached data-streaming exampleFigure 4 illustrates an example of the Xbox

360 using its data-streaming features for anXPS workload. Consider the XPS workload,acting as a decompression kernel running onone or more CPU SMT hardware threads.First, the XPS kernel must fetch new, unique

30

HOT CHIPS 17

IEEE MICRO

UncachedUnit2

UncachedUnit2

Core 2

L1I32

KbytesL1D32

Kbytes

Instruction unit


VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Core 1

L1I32

KbytesL1D32

Kbytes

Instruction unit


VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Core 0

L1I32

KbytesL1D32

Kbytes

Instruction unit


VS

U VMXFP

VMXperm

VMXsimp FPU MMU

Node crossbar/queuing

Bus interface

Front side bus (FSB)

L2directory

L2directory

UncachedUnit2 L2 dataPIC

Test,debug,clocks,

temperaturesensor.

L2

VSUPermSimpMMU

IntPIC

Vector/scalar unitPermuteSimpleMain-memory unitIntegerProgrammable interruptcontroller

xDCBT 128-byte prefetcharound L2, into L1 data cache

D3D compressed data,VMX stores to L2

Non-sequential gathering,locked set in L2

GPU 128-byte read from L2

To GPUFrom memory

Figure 4. CPU cached data-streaming example.

data from memory to enable generation of thegiven piece of geometry. This likely includesworld space coordinate data and specific datato make each geometry instance unique. TheXPS kernel prefetches this read data during aprevious geometry generation iteration tocover the fetch’s memory latency. Becausenone of the per-instance read data is typical-ly reused between threads, the XPS kernelfetches it using the xDCBT prefetch instruc-tion around the L2, which puts it directly intothe requesting CPU core’s L1 data cache.Prefetching around the L2 separates the readdata stream from the write data stream, avoid-ing L2 cache thrashing. Figure 4 shows thisstep as a solid-line arc from memory to Core0’s L1 data cache.

The XPS kernel then crunches the data,primarily using the VMX128 computationability to generate far more geometry datathan the amount read from memory. Beforethe data is written out, the XPS kernel com-presses it, using the D3D compressed dataformats, which offer simple trade-offsbetween number of bits, range, and precision.The XPS kernel stores these results as gener-ated to the locked set in the L2, with onlyminimal attention to the write access pattern’srandomness (for example, the kernel placeswrite accesses within a few cache lines of eachother for efficient gathering). Furthermore,because of the write-through and no-write-allocate nature of the L1 data caches, none ofthe write data will thrash the L1 data cacheof the CPU core. The diagram shows this stepas a dashed-line arc from load/store in Core0 to the locked set in L2.

Once the CPU core has issued the stores,the store data sits in the gathering buffers wait-ing for more data until timed out or forcedout by incoming write data demanding new64-byte ranges. The XPS output data is writ-ten to software-managed FIFOs in the L2 dataarrays in a locked set in the L2 (the unshadedbox in Figure 4). There are multiple FIFOs inone locked set, so multiple threads can shareone L2 set. This is possible within 128 Kbytesof one set because tail pointer write-back com-munication frees completed FIFO area withlowered latency. Using the locked set is impor-tant; otherwise, high-bandwidth write streamswould thrash the L2 working set.

Next, when more data is available to the

GPU, the CPU notifies the GPU that theGPU can advance within the FIFO, and theGPU performs 128-byte reads to the FSB.This step is shown in the diagram as the dot-ted-line arc starting in the L2 and going to theGPU. The GPU design incorporates specialfeatures allowing it to read from the FSB, incontrast with the normal GPU read frommain memory. The GPU also has an added128-byte fetch, which enables maximum FSBand L2 data array utilization.

The two final steps are not shown in thediagram. First, the GPU uses the corre-sponding D3D compressed data format sup-port to expand the compressed D3D formatsinto single-precision floating-point formatsnative to the 3D core. Then, the GPU com-mands tail pointer write-backs to the CPU toindicate that the GPU has finished readingdata. This tells the streaming FIFOs’ CPUsoftware control that the given FIFO space isnow free to be written with new geometry orindex data.

Figure 5 shows a photo of the CPU die,which contains 165 million transistors in anIBM second-generation 90-nm silicon-on-insulator (SOI) enhanced transistor process.


Figure 5. Xbox 360 CPU die photo (courtesy of IBM).

Graphics processing unitThe GPU is the latest-generation graphics

processor from ATI. It runs at 500 MHz andconsists of 48 parallel, combined vector andscalar shader ALUs. Unlike earlier graphicsengines, the shaders are dynamically allocat-ed, meaning that there are no distinct vertexor pixel shader engines—the hardware auto-matically adjusts to the load on a fine-grainedbasis. The hardware is fully compatible withD3D 9.0 and High-Level Shader Language(HLSL) 3.0,9,10 with extensions.

The ALUs are 32-bit IEEE 754 floating-point ALUs, with relatively common graphicssimplifications of rounding modes, denor-malized numbers (flush to zero on reads),NaN handling, and exception handling. Theyare capable of vector (including dot product)and scalar operations with single-cyclethroughput—that is, all operations issue everycycle. The superscalar instructions encode vec-tor, scalar, texture load, and vertex fetch with-in one instruction. This allows peakprocessing of 96 shader calculations per cyclewhile fetching textures and vertices.

Feeding the shaders are 16 texture fetchengines, each capable of producing a filteredresult in each cycle. In addition, there are 16programmable vertex fetch engines with built-in tessellation that the system can use insteadof CPU geometry generation. Finally, thereare 16 interpolators in dedicated hardware.

The render back end can sustain eight pix-els per cycle or 16 pixels per cycle for depthand stencil-only rendering (used in z-prepassor shadow buffers). The dedicated z or blendlogic and the EDRAM guarantee that eightpixels per cycle can be maintained even with4× antialiasing and transparency. The z-prepass is a technique that performs a first-pass rendering of a command list, with no ren-dering features applied except occlusiondetermination. The z-prepass initializes the z-buffer so that on a subsequent rendering passwith full texturing and shaders applied, dis-carded pixels won’t spend shader and textur-ing resources on occluded pixels. With modernscene depth complexity, this technique signif-icantly improves rendering performance, espe-cially with complex shader programs.

As an example benchmark, the GPU canrender each pixel with 4× antialiasing, a z-buffer, six shader operations, and two texture

fetches and can sustain this at eight pixels percycle. This blazing fill rate enables the Xbox360 to deliver HD-resolution rendering simul-taneously with many state-of-the-art effectsthat traditionally would be mutually exclusivebecause of fill rate limitations. For example,games can mix particle, high-dynamic-range(HDR) lighting, fur, depth-of-field, motionblur, and other complex effects.

For next-generation geometric detail, shad-ing, and fill rate, the pipeline’s front end canprocess one triangle or vertex per cycle. Theseare essentially full-featured vertices (ratherthan a single parameter), with the practicallimitation of required memory bandwidthand storage. To overcome this limitation, sev-eral compressed formats are available for eachdata type. In addition, XPS can transientlygenerate data on the fly within the CPU andpass it efficiently to the GPU without a mainmemory pass.

The EDRAM removes the render targetand z-buffer fill rate from the bandwidthequation. The EDRAM resides on a separatedie from the main portion of GPU logic. TheEDRAM die also contains dedicated alphablend, z-test, and antialiasing logic. The inter-face to the EDRAM macro runs at 256Gbytes/s: (8 pixels/cycle + 8 z-compares/cycle)× (read + write) × 32 bits/sample × 4 sam-ples/pixel × 500 MHz.

The GPU supports several pixel depths; 32bits per pixel (bpp) and 64 bpp are the mostcommon, but there is support for up to 128bpp for multiple-render-target (MRT) orfloating-point output. MRT is a graphicstechnique of outputting more than one pieceof data per sample to the effective framebuffer, interleaved efficiently to minimize theperformance impact of having more data. Thedata is used later for a variety of advancedgraphics effects. To optimize space, the GPUsupports 32-bpp and 64-bpp HDR lightingformats. The EDRAM only supports render-ing operations to the render target and z-buffer. For render-to-texture, the GPU must“flush” the appropriate buffer to main mem-ory before using the buffer as a texture.

Unlike a fine-grained tiler architecture, theGPU can achieve common HD resolutions andbit depths within a couple of EDRAM tiles.This simplifies the problem substantially. Tra-ditional tiling architectures typically include a

32

HOT CHIPS 17

IEEE MICRO

whole process inserted in the traditional graph-ics pipeline for binning the geometry into a largenumber of bins. Handling the bins in a high-performance manner is complicated (for exam-ple, overflow cases, memory footprint, andbandwidth). Because the GPU’s EDRAM usu-ally requires only a couple of bins, bin handlingis greatly simplified, allowing more-optimalhardware-software partitioning.

With a binning architecture, the full com-mand list must be presented before rendering.The hardware uses a few tricks to speed thisprocess up. Rendering increasingly relies on az-prepass to prepare the z-buffer before exe-cuting complex pixel shader algorithms. Wetake advantage of this by collecting objectextent information during this pass, as well aspriming a full-resolution hierarchical z-buffer.We use the extent information to set flags toskip command list sections not needed with-in a tile. The full-resolution hi-z buffer retainsits state between tiles.

In another interesting extension to normalD3D, the GPU supports a shader export fea-ture that allows data to be output directlyfrom the shader to a buffer in memory. Thislets the GPU serve as a vector math engine if

needed, as well as allowing multipass shaders.The latter can be useful for subdivision sur-faces. In addition, the display pipelineincludes an in-line scaler that resizes the framebuffer on the fly as it is output. This featureallows games to pick a rendering resolution towork with and then lets the display hardwaremake the best match to the display resolution.

As Figure 6 shows, the GPU consists of thefollowing blocks:

• Bus interface unit. This interface to theFSB handles CPU-initiated transactions,as well as GPU-initiated transactionssuch as snoops and L2 cache reads.

• I/O controller. Handles all internal mem-ory-mapped I/O accesses, as well as trans-actions to and from the I/O chip via thetwo-lane PCI-Express bus (PCI-E).

• Memory controllers (MC0, MC1). These128-byte interleaved GDDR3 memorycontrollers contain aggressive addresstiling for graphics and a fast path to min-imize CPU latency.

• Memory interface. Memory crossbar andbuffering for non-CPU initiators (suchas graphics, I/O, and display).


FSB

Video

MC

0

Mem

ory

inte

rfac

e

MC

1

AA

+A

Z

Mem

0M

em 1

Main dieBus

interface unit

I/O controller PCI-E

Graphics

Display

10-MbyteEDRAM

Vertexcache

TexturecacheHi-Z

Commandprocessor

Vertex assembly/tesselatorSequencer

InterpolatorsShader complexShader export

Blending interface

DRAM die

I/O

High-speedI/O bus

Figure 6. GPU block diagram.

• Graphics. This block, the largest on thechip, contains the rendering engine.

• High-speed I/O bus. This bus between thegraphics core and the EDRAM die is achip-to-chip bus (via substrate) operat-ing at 1.8 GHz and 28.8 Gbytes/s. Whenmultisample antialiasing is used, onlypixel center data and coverage informa-

tion is transferred and then expanded onthe EDRAM die.

• Antialiasing and Alpha/A (AA+AZ). Han-dles pixel-to-sample expansion, as well asz-test and alpha blend.

• Display.

Figures 7 and 8 show photos of the GPU“parent” and EDRAM (“daughter”) dies. Theparent die contains 232 million transistors ina TSMC 90-nm GT. The EDRAM die con-tains 100 million transistors in an NEC 90-nm process.

Architectural choicesThe major choices we made in designing

the Xbox 360 architecture were to use chipmultiprocessing (CMP), in-order issuancecores, and EDRAM.

Chip multiprocessingOur reasons for using multiple CPU cores

on one chip in Xbox 360 was relativelystraightforward. The combination of powerconsumption and diminishing returns frominstruction-level parallelism (ILP) is drivingthe industry in general to multicore. CMP isa natural twist on traditional symmetric mul-tiprocessing (SMP), in which all the CPUcores are symmetric and have a common viewof main memory but are on the same die ver-sus separate chips. Modern process geometriesafford hardware designers the flexibility ofCMP, which was usually too costly in die areapreviously. Having multiple cores on one chipis more cost-effective. It enables shared L2implementation and minimizes communica-tion latency between cores, resulting in high-er overall performance for the same die areaand power consumption.

In addition, we wanted to optimize thearchitecture for the workload, optimize in-game utilization of silicon area, and keep thesystem easy to program. These goals madeCMP a good choice for several reasons:

First, for the game workload, both integerand floating-point performance are impor-tant. The high-level game code is generally adatabase management problem, with plentyof object-oriented code and pointer manipu-lation. Such a workload needs a large L2 andhigh integer performance. The CMP sharedL2 with its fine-grained, dynamic allocation

34

HOT CHIPS 17

IEEE MICRO

Figure 7. Xbox 360 GPU “parent” die (courtesy of TaiwanSemiconductor Manufacturing Co.).

Figure 8. Xbox 360 GPU EDRAM (“daughter”) die (courtesyof NEC Electronics).

means this workload can use a large workingset in the L2 while running. In addition, sev-eral sections of the application lend themselveswell to vector floating-point acceleration.

Second, to optimize silicon area, we cantake advantage of two factors. To start with,we are presenting a stable platform for theproduct’s lifetime. This means tools and pro-gramming expertise will mature significantly,so we can rely more on generating code thanoptimizing performance at runtime. More-over, all Xbox 360 games (as opposed to Xboxgames from Microsoft’s first game console,which are emulated on Xbox 360) are com-piled from scratch and optimized for the cur-rent microarchitecture. We don’t have theproblem of running legacy, but compatible,instruction set architecture executables thatwere compiled and optimized for a completelydifferent microarchitecture. This problem hassignificant implications for CPU microarchi-tectures in PC and server markets.

Third, although we knew multicore was theway to go, the tools and programming exper-tise for multithread programming are certain-ly not mature, presenting a problem for ourgoal of keeping programming easy. For thetypes of workloads present in a game engine,we could justify at most six to eight threads inthe system. The solution was to adapt the“more-but-simpler” philosophy to the CPUcore topology. The key was keeping the num-ber of hardware threads limited, thus increas-ing the chance that they would be usedeffectively. We decided the best approach wasto tightly couple dedicated vector math enginesto integer cores rather than making themautonomous. This keeps the number of threadslow and allows vector math routines to be opti-mized and run on separate threads if necessary.

In-order issuance coresThe Xbox 360 CPU contains three two-

issue, in-order instruction issuance cores. Eachcore has two SMT hardware threads, whichsupport fine-grained instruction issuance. Thecores allow out-of-order execution in the com-mon cases of loads and vector/floating-pointversus integer instructions. Loads, which aretreated as prefetches, don’t stall until a loaddependency is present. Vector and floating-point operations have their own, decoupledvector/float issue queue (VIQ), which decou-

ples vector/floating point versus integerissuance in many cases.

We had several reasons for choosing in-order issuance. First, the die area required byin-order-issuance cores is less than that of out-of-order-issuance cores. In-order cores sim-plify issue logic considerably. Although notdirectly a big area user, out-of-order issue logiccan consume extra area because it requiresadditional pipeline stages to meet clock peri-od timing. Further, common implementa-tions of out-of-order issuance and completionuse rename registers and completion queues,which take significant die area.

Second, in-order implementation is morepower efficient than out-of-order implemen-tation. Keeping power levels manageable wasa major issue for the design team. All the addi-tional die area required for out-of-orderissuance consumes power. Out-of-order corescommonly increase performance because theirissuance, tracking, and completion enabledeeper speculative instruction execution. Thisdeeper speculation means wasted power sincewhole execution strings are often thrownaway. Xbox 360 execution does speculate butto a lesser degree.

Third, the Xbox 360’s two SMT hardwarethreads per core keep the execution pipelinesmore fully utilized than they are in traditionalin-order designs. This helps keep the executionpipelines busy without out-of-order issuance.

Finally, in-order design is simpler, aidingdesign and implementation. Simplicity alsomakes performance more predictable, simpli-fying programming and tool optimizations.

EDRAMHD, alpha blending, z-buffering, antialias-

ing, and HDR pixels take a heavy toll on mem-ory bandwidth. Although more effects arebeing achieved in the shaders, postprocessingeffects still require a large pixel-depth com-plexity. Also as texture filtering improves, texelfetches can consume large amounts of memo-ry bandwidth, even with complex shaders.

One approach to solving this problem is touse a wide external memory interface. Thislimits the ability to use higher-density mem-ory technology as it becomes available, as wellas requiring compression. Unfortunately, anycompression technique must be lossless,which means unpredictable—generally not


good for game optimization. In addition, therequired bandwidth would most likelyrequire using a second memory controller inthe CPU itself, rather than having a unifiedmemory architecture, further reducing sys-tem flexibility.

EDRAM was the logical alternative. It hasthe advantage of completely removing the ren-der target and the z-buffer bandwidth fromthe main-memory bandwidth equation. Inaddition, alpha blending and z-buffering areread-modify-write processes, which furtherreduce the efficiency of memory bandwidthconsumption. Keeping these processes on-chip means that the remaining high-band-width consumers—namely, geometry andtexture—are now primarily read processes.Changing the majority of main-memorybandwidth to read requests increases main-memory efficiency by reducing wasted mem-ory bus cycles caused by turning around thebidirectional memory buses.

SoftwareBy adopting SMP and

SMT, we’re using standardparallel models, which keepthings simple. Also, the uni-fied memory architectureallows flexible use of memoryresources.

Our operating systemopens all three cores to gamedevelopers to program as theywish. For this, we providestandard APIs includingWin32 and OpenMP, as wellas D3D and HLSL. Devel-opers can also bypass theseand write their own CPUassembly and shader microc-ode, referred to in the gameindustry as “to the metal”programming.

We provide standard toolsincluding XNA-based toolsPerformance Investigator(PIX) and Xbox Audio Cre-ation Tool (XACT). XNA isMicrosoft’s game develop-ment platform, which devel-opers of PC and Xbox 360games (as well as other plat-forms) can use to minimize

cross-platform development costs.11 PIX is thegraphics profiler and debugger.12 It uses per-formance counters embedded in the CPU andGPU and architectural simulators to provideperformance feedback.

The Xbox 360 development environment isfamiliar to most programmers. Figure 9 showsa screen shot from the XNA Studio Integrat-ed Development Environment (IDE), a ver-sion of Visual Studio with additional featuresfor game developer teams. Programmers useIDE for building projects and debugging,including debugging of multiple threads.When stepping through source code, pro-grammers find the instruction set’s low-leveldetails completely hidden, but when theyopen the disassembly window, they can seethat PowerPC code is running.

Other powerful tools that help Xbox 360developers maximize productivity and per-formance include CPU profilers, the VisualC++ 8.0 compiler, and audio libraries. These

36

HOT CHIPS 17

IEEE MICRO

Figure 9. Multithreaded debugging in the Xbox 360 development environment.

tools and libraries let programmers quicklyexploit the power of Xbox 360 chips and thenhelp them code to the metal when necessary.

Xbox 360 was launched to customers in theUS, Canada, and Puerto Rico on 22

November, 2005. Between then and the end of2005, it launched in Europe and Japan. Dur-ing the first quarter of 2006, Xbox 360 hasbeen launched in Central and South America,Southeast Asia, Australia, and New Zealand.

Xbox 360 implemented a number of firstsand/or raised the bar for performance for usersof PC and game console machines for gam-ing. These include

• the first CMP implementation withmore than 2 cores (3);

• the highest frequency and bandwidthCPU frontside bus (5.4 Gbps and 21.6GB/s);

• the first CMP chip with shared L2;• the first game console with SMT;• the first game console with GPU-unified

shader architecture; and• the first game console with MCM

GPU/EDRAM die implementation.

The Xbox 360 core chipset containsapproximately 500 M transistors. It is themost complex, highest performance consumerelectronic product shipping today and pre-sents a large discrete jump in 3D graphics andgaming performance.

References1. PowerPC Microprocessor Family: AltiVec

Technology Programming EnvironmentsManual, version 2.0, IBM Corp., 2003.

2. E. Silha et al., PowerPC User Instruction SetArchitecture, Book I, version 2.02, IBMCorp., 2005.

3. E. Silha et al., PowerPC Virtual EnvironmentArchitecture, Book II, version 2.02, IBMCorp., 2005.

4. E. Silha et al., PowerPC Operating Environ-ment Architecture, Book III, version 2.02,IBM Corp., 2005.

5. J. Hennessey and D. Patterson, Computer

Architecture: A Quantitative Approach, 3rded., Morgan Kaufmann, 2002.

6. K. Gray, The Microsoft DirectX 9 Program-mable Graphics Pipeline, Microsoft Press,2003.

7. F. Luna, Introduction to 3D Game Program-ming with DirectX 9.0, 1st ed., Wordware,2003.

8. MSDN DX9 SDK Documentation, Direct3DOverview, 2005, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/dx9_graphics.asp.

9. S. St.-Laurent, The Complete Effect andHLSL Guide, Paradoxal Press, 2005.

10. MSDN DX9 SDK Documentation, HLSLShaders Overview, 2005, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/HLSL_Workshop.asp.

11. Microsoft XNA, 2006, http://www.microsoft.com/xna.

12. MSDN DX9 SDK Documentation, PIXOverview, 2005, http://msdn.microsoft.com/library/default.asp?url=/library/en-us/directx9_c/PIX.asp.

Jeff Andrews is a CPU architect and projectleader in Microsoft’s Xbox Console Architec-ture Group, focusing on the Xbox 360 launchCPU. Andrews has a BS in computer engi-neering from the University of IllinoisUrbana-Champaign.

Nick Baker is the director of Xbox consolearchitecture at Microsoft. His responsibilitiesinclude managing the Xbox Console Archi-tecture, System Verification, and Test Soft-ware teams. Baker has an MS in electricalengineering from Imperial College London.

Direct questions and comments about thisarticle to Jeff Andrews, Microsoft, 1065 LaAvenida St., Mountain View, CA 94043; [email protected].

For further information on this or any othercomputing topic, visit our Digital Library athttp://www.computer.org/publications/dlib.


xbox 360 system architecture

Documents