cell broadband processor daniel bagley meng tan. agenda general intro history of development ...

Cell Broadband ProcessorCell Broadband Processor

Daniel BagleyDaniel Bagley

Meng TanMeng Tan

AgendaAgenda

General IntroGeneral Intro History of developmentHistory of development Technical overview of architectureTechnical overview of architecture Detailed technical discussion of Detailed technical discussion of

componentscomponents Design choicesDesign choices Other processors like the cellOther processors like the cell Programming for the cellProgramming for the cell

History of DevelopmentHistory of Development

Sony Playstation2Sony Playstation2• Announce March 1999Announce March 1999• Released March 2000 in JapanReleased March 2000 in Japan• 128bit “Emotion Engine”128bit “Emotion Engine”• 294mhz, MIPS CPU294mhz, MIPS CPU• Single Precision FP OptimizationsSingle Precision FP Optimizations• 6.2gflops6.2gflops

History ContinuedHistory Continued

Partnership between Sony, Toshiba, Partnership between Sony, Toshiba, IBMIBM

Summer of 2000 – High level Summer of 2000 – High level development talksdevelopment talks

Initial goal of 1000x PS2 PowerInitial goal of 1000x PS2 Power March 2001, Sony-IBM-Toshiba March 2001, Sony-IBM-Toshiba

design center openeddesign center opened $400m investment.$400m investment.

Overall Goals for CellOverall Goals for Cell

High performance in multimedia High performance in multimedia appsapps

Real time performanceReal time performance Power consumptionPower consumption CostCost Available by 2005Available by 2005 Avoid memory latency issues Avoid memory latency issues

associated with control structuresassociated with control structures

The Cell itselfThe Cell itself

Power PC based Power PC based main core (PPE)main core (PPE)

Multiple SPEsMultiple SPEs On die memory On die memory

controllercontroller Inter-core Inter-core

transport bustransport bus High speed IOHigh speed IO

Cell Die LayoutCell Die Layout

Cell ImplementationCell Implementation

Cell is an architectureCell is an architecture Preliminary PS3 ImplementationPreliminary PS3 Implementation

• 1 PPE1 PPE• 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase)• 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process• Clocked at 3-4ghzClocked at 3-4ghz• 256GFLOPS Single Precision @ 4ghz 256GFLOPS Single Precision @ 4ghz

Why a Cell ArchitectureWhy a Cell Architecture

Follows a trend in computing Follows a trend in computing architecturearchitecture

Natural extension of dual and multi-Natural extension of dual and multi-corecore

Extremely low hardware overheadExtremely low hardware overhead Software controllableSoftware controllable Specialized hardware more useful for Specialized hardware more useful for

multimediamultimedia

Possible UsesPossible Uses

Playstation3 Playstation3 (Obviously)(Obviously)

Blade servers (IBM)Blade servers (IBM)• Amazing single Amazing single

precision FP precision FP performanceperformance

• Scientific applicationsScientific applications Toshiba HDTV Toshiba HDTV

productsproducts

Power Processing ElementPower Processing Element

PowerPC instruction set with AltiVecPowerPC instruction set with AltiVec Used for general purpose computing Used for general purpose computing

and controlling SPE’sand controlling SPE’s Simultaneous MultithreadingSimultaneous Multithreading Separate 32 KB L1 Caches and Separate 32 KB L1 Caches and

unified 512 KB L2 Cacheunified 512 KB L2 Cache

PPE (cont.)PPE (cont.)

Slow but power efficient PowerPC Slow but power efficient PowerPC instruction set implementationinstruction set implementation

Two issue in-order instruction fetchTwo issue in-order instruction fetch Conspicuous lack of instruction windowConspicuous lack of instruction window Compare to conventional PowerPC Compare to conventional PowerPC

implementations (G5)implementations (G5) Performance depends on SPE Performance depends on SPE

utilizationutilization

Synergistic Processing Element (SPE)Synergistic Processing Element (SPE)

Specialized hardwareSpecialized hardware Meant to be used in Meant to be used in

parallelparallel• (7 on PS3 implementation)(7 on PS3 implementation)

On chip memory (256kb)On chip memory (256kb) No branch predictionNo branch prediction In-order executionIn-order execution Dual issueDual issue

SPE ArchitectureSPE Architecture

0.99µm2 on 90nm Process0.99µm2 on 90nm Process 128 registers (128 bits wide)128 registers (128 bits wide)

• Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit Variant of VMX instruction setVariant of VMX instruction set

• Modified for 128 registersModified for 128 registers On chip memory is NOT a cacheOn chip memory is NOT a cache

SPE ExecutionSPE Execution

Dual issue, in-orderDual issue, in-order Seven execution unitsSeven execution units Vector logicVector logic 8 single precision operations per 8 single precision operations per

cyclecycle Significant performance hit for Significant performance hit for

double precisiondouble precision

SPE Execution DiagramSPE Execution Diagram

SPE Local Storage AreaSPE Local Storage Area

NOT a cacheNOT a cache 256kb, 4 x 64kb ECC single port 256kb, 4 x 64kb ECC single port

SRAMSRAM Completely private to each SPECompletely private to each SPE Directly addressable by softwareDirectly addressable by software Can be used as a cache, but only Can be used as a cache, but only

with software controlswith software controls No tag bits, or any extra hardwareNo tag bits, or any extra hardware

SPE LS SchedulingSPE LS Scheduling

Software controlled DMASoftware controlled DMA DMA to and from main memoryDMA to and from main memory Scheduling a HUGE problemScheduling a HUGE problem

• Done primarily in softwareDone primarily in software• IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally

Request queue handles 16 simultaneous Request queue handles 16 simultaneous requestsrequests• Up to 16 kb transfer eachUp to 16 kb transfer each• Priority: DMA, L/S, Fetch Priority: DMA, L/S, Fetch

Fetch / execute parallelismFetch / execute parallelism

SPE Control LogicSPE Control Logic

Very little in comparisonVery little in comparison Represents shift in focusRepresents shift in focus Complete lack of branch predictionComplete lack of branch prediction

• Software branch predictionSoftware branch prediction• Loop unrollingLoop unrolling• 18 cycle penalty18 cycle penalty

Software controlled DMASoftware controlled DMA

SPE PipelineSPE Pipeline

Little ILP, and thus Little ILP, and thus little control logiclittle control logic

Dual issueDual issue Simple commit Simple commit

unit (no reorder unit (no reorder buffer or other buffer or other complexities)complexities)

Same execution Same execution unit for FP/intunit for FP/int

SPE SummarySPE Summary

Essentially small vector computerEssentially small vector computer Based on Altivec/VMX ISABased on Altivec/VMX ISA

• Extensions for DMA and LS managementExtensions for DMA and LS management• Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile

Uniquely suited for real time applicationsUniquely suited for real time applications Extremely fast for certain FP operationsExtremely fast for certain FP operations Offload a large amount on to compiler / Offload a large amount on to compiler /

software.software.

Element Interconnect BusElement Interconnect Bus

4 concentric rings connecting all Cell 4 concentric rings connecting all Cell elementselements

128-bit wide interconnects128-bit wide interconnects

EIB (cont.)EIB (cont.)

Designed to minimize coupling noiseDesigned to minimize coupling noise Rings of data traveling in alternating Rings of data traveling in alternating

directionsdirections Buffers and repeaters at each SPE Buffers and repeaters at each SPE

boundaryboundary Architecture can be scaled up with Architecture can be scaled up with

increased bus latencyincreased bus latency

EIB (cont.)EIB (cont.)

Total bandwidth at ~200GB/sTotal bandwidth at ~200GB/s EIB controller located physically in EIB controller located physically in

center of chip between SPE’scenter of chip between SPE’s Controller reserves channels for each Controller reserves channels for each

individual data transfer requestindividual data transfer request Implementation allows for SPE Implementation allows for SPE

extension horizontallyextension horizontally

Memory InterfaceMemory Interface

Rambus XDR memory to keep Cell at Rambus XDR memory to keep Cell at full utilizationfull utilization

3.2 Gbps data bandwidth per device 3.2 Gbps data bandwidth per device connected to XDR interfaceconnected to XDR interface

Cell uses dual channel XDR with four Cell uses dual channel XDR with four devices and 16-bit wide buses to devices and 16-bit wide buses to achieve 25.2 GB/s total memory achieve 25.2 GB/s total memory bandwidthbandwidth

Input / Output BusInput / Output Bus

Rambus FlexIO BusRambus FlexIO Bus IO interface consists of 12 IO interface consists of 12

unidirectional byte lanesunidirectional byte lanes Each lane supports 6.4 GB/s Each lane supports 6.4 GB/s

bandwidthbandwidth 7 outbound lanes and 5 inbound 7 outbound lanes and 5 inbound

laneslanes

Design ChoicesDesign Choices

In-order executionIn-order execution• Abandoning ILPAbandoning ILP• ILP – 10-20% increase per generationILP – 10-20% increase per generation• Reducing control logicReducing control logic• Real time responsivenessReal time responsiveness

Cache DesignCache Design• Software configuration on SPESoftware configuration on SPE• Standard L2 cache on PPEStandard L2 cache on PPE

Cell Programming IssuesCell Programming Issues

No Cell compiler in existence to manage No Cell compiler in existence to manage utilization of SPE’s at compile timeutilization of SPE’s at compile time

SPE’s do not natively support context SPE’s do not natively support context switching. Must be OS managed.switching. Must be OS managed.

SPE’s are vector processors. Not efficient SPE’s are vector processors. Not efficient for general-purpose computation.for general-purpose computation.

PPE’s and SPE’s use different instruction PPE’s and SPE’s use different instruction sets.sets.

Cell Programming (cont.)Cell Programming (cont.)

Functional Offload ModelFunctional Offload Model Simplest model for Cell programmingSimplest model for Cell programming Optimize existing libraries for SPE Optimize existing libraries for SPE

computationcomputation Requires no rebuild of main Requires no rebuild of main

application logic which runs on PPEapplication logic which runs on PPE


Device Extension ModelDevice Extension Model Take advantage of SPE DMATake advantage of SPE DMA Use SPE’s as interfaces to external Use SPE’s as interfaces to external

devicesdevices


Computational Acceleration Model Computational Acceleration Model Traditional super-computing methods Traditional super-computing methods

using Cellusing Cell Shared memory or message passing Shared memory or message passing

paradigm for accelerating inherently paradigm for accelerating inherently parallel math operationsparallel math operations

Can overwrite intensive math Can overwrite intensive math libraries without rewriting libraries without rewriting applicationsapplications


Streaming modelStreaming model Use Cell processor as one large Use Cell processor as one large

programmable pipelineprogrammable pipeline Partition algorithms into logically Partition algorithms into logically

sensible steps. Execute each sensible steps. Execute each separately, in serial, on separate separately, in serial, on separate processors.processors.


Asymmetric Thread Runtime ModelAsymmetric Thread Runtime Model Abstract Cell architecture away from Abstract Cell architecture away from

programmer.programmer. Use OS to use processors to each run Use OS to use processors to each run

different threads.different threads.

Sample PerformanceSample Performance

Demonstration physics engine for Demonstration physics engine for real-time gamereal-time game

http://www.research.ibm.com/cell/whhttp://www.research.ibm.com/cell/whitepapers/cell_online_game.pdfitepapers/cell_online_game.pdf

182 Compute to DMA ratio on SPE’s182 Compute to DMA ratio on SPE’s For the right tasks, Cell architecture For the right tasks, Cell architecture

can be extremely efficient.can be extremely efficient.

cell broadband processor daniel bagley meng tan. agenda general intro history of development ...

Documents

cell slide

multimedia slide

gflops slide

spe utilization slide

layout slide

double precision slide

spe execution diagram

cell implementation