cell broadband processor daniel bagley meng tan. agenda general intro history of development ...
TRANSCRIPT
Cell Broadband ProcessorCell Broadband Processor
Daniel BagleyDaniel Bagley
Meng TanMeng Tan
AgendaAgenda
General IntroGeneral Intro History of developmentHistory of development Technical overview of architectureTechnical overview of architecture Detailed technical discussion of Detailed technical discussion of
componentscomponents Design choicesDesign choices Other processors like the cellOther processors like the cell Programming for the cellProgramming for the cell
History of DevelopmentHistory of Development
Sony Playstation2Sony Playstation2• Announce March 1999Announce March 1999• Released March 2000 in JapanReleased March 2000 in Japan• 128bit “Emotion Engine”128bit “Emotion Engine”• 294mhz, MIPS CPU294mhz, MIPS CPU• Single Precision FP OptimizationsSingle Precision FP Optimizations• 6.2gflops6.2gflops
History ContinuedHistory Continued
Partnership between Sony, Toshiba, Partnership between Sony, Toshiba, IBMIBM
Summer of 2000 – High level Summer of 2000 – High level development talksdevelopment talks
Initial goal of 1000x PS2 PowerInitial goal of 1000x PS2 Power March 2001, Sony-IBM-Toshiba March 2001, Sony-IBM-Toshiba
design center openeddesign center opened $400m investment.$400m investment.
Overall Goals for CellOverall Goals for Cell
High performance in multimedia High performance in multimedia appsapps
Real time performanceReal time performance Power consumptionPower consumption CostCost Available by 2005Available by 2005 Avoid memory latency issues Avoid memory latency issues
associated with control structuresassociated with control structures
The Cell itselfThe Cell itself
Power PC based Power PC based main core (PPE)main core (PPE)
Multiple SPEsMultiple SPEs On die memory On die memory
controllercontroller Inter-core Inter-core
transport bustransport bus High speed IOHigh speed IO
Cell Die LayoutCell Die Layout
Cell ImplementationCell Implementation
Cell is an architectureCell is an architecture Preliminary PS3 ImplementationPreliminary PS3 Implementation
• 1 PPE1 PPE• 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase)• 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process• Clocked at 3-4ghzClocked at 3-4ghz• 256GFLOPS Single Precision @ 4ghz 256GFLOPS Single Precision @ 4ghz
Why a Cell ArchitectureWhy a Cell Architecture
Follows a trend in computing Follows a trend in computing architecturearchitecture
Natural extension of dual and multi-Natural extension of dual and multi-corecore
Extremely low hardware overheadExtremely low hardware overhead Software controllableSoftware controllable Specialized hardware more useful for Specialized hardware more useful for
multimediamultimedia
Possible UsesPossible Uses
Playstation3 Playstation3 (Obviously)(Obviously)
Blade servers (IBM)Blade servers (IBM)• Amazing single Amazing single
precision FP precision FP performanceperformance
• Scientific applicationsScientific applications Toshiba HDTV Toshiba HDTV
productsproducts
Power Processing ElementPower Processing Element
PowerPC instruction set with AltiVecPowerPC instruction set with AltiVec Used for general purpose computing Used for general purpose computing
and controlling SPE’sand controlling SPE’s Simultaneous MultithreadingSimultaneous Multithreading Separate 32 KB L1 Caches and Separate 32 KB L1 Caches and
unified 512 KB L2 Cacheunified 512 KB L2 Cache
PPE (cont.)PPE (cont.)
Slow but power efficient PowerPC Slow but power efficient PowerPC instruction set implementationinstruction set implementation
Two issue in-order instruction fetchTwo issue in-order instruction fetch Conspicuous lack of instruction windowConspicuous lack of instruction window Compare to conventional PowerPC Compare to conventional PowerPC
implementations (G5)implementations (G5) Performance depends on SPE Performance depends on SPE
utilizationutilization
Synergistic Processing Element (SPE)Synergistic Processing Element (SPE)
Specialized hardwareSpecialized hardware Meant to be used in Meant to be used in
parallelparallel• (7 on PS3 implementation)(7 on PS3 implementation)
On chip memory (256kb)On chip memory (256kb) No branch predictionNo branch prediction In-order executionIn-order execution Dual issueDual issue
SPE ArchitectureSPE Architecture
0.99µm2 on 90nm Process0.99µm2 on 90nm Process 128 registers (128 bits wide)128 registers (128 bits wide)
• Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit Variant of VMX instruction setVariant of VMX instruction set
• Modified for 128 registersModified for 128 registers On chip memory is NOT a cacheOn chip memory is NOT a cache
SPE ExecutionSPE Execution
Dual issue, in-orderDual issue, in-order Seven execution unitsSeven execution units Vector logicVector logic 8 single precision operations per 8 single precision operations per
cyclecycle Significant performance hit for Significant performance hit for
double precisiondouble precision
SPE Execution DiagramSPE Execution Diagram
SPE Local Storage AreaSPE Local Storage Area
NOT a cacheNOT a cache 256kb, 4 x 64kb ECC single port 256kb, 4 x 64kb ECC single port
SRAMSRAM Completely private to each SPECompletely private to each SPE Directly addressable by softwareDirectly addressable by software Can be used as a cache, but only Can be used as a cache, but only
with software controlswith software controls No tag bits, or any extra hardwareNo tag bits, or any extra hardware
SPE LS SchedulingSPE LS Scheduling
Software controlled DMASoftware controlled DMA DMA to and from main memoryDMA to and from main memory Scheduling a HUGE problemScheduling a HUGE problem
• Done primarily in softwareDone primarily in software• IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally
Request queue handles 16 simultaneous Request queue handles 16 simultaneous requestsrequests• Up to 16 kb transfer eachUp to 16 kb transfer each• Priority: DMA, L/S, Fetch Priority: DMA, L/S, Fetch
Fetch / execute parallelismFetch / execute parallelism
SPE Control LogicSPE Control Logic
Very little in comparisonVery little in comparison Represents shift in focusRepresents shift in focus Complete lack of branch predictionComplete lack of branch prediction
• Software branch predictionSoftware branch prediction• Loop unrollingLoop unrolling• 18 cycle penalty18 cycle penalty
Software controlled DMASoftware controlled DMA
SPE PipelineSPE Pipeline
Little ILP, and thus Little ILP, and thus little control logiclittle control logic
Dual issueDual issue Simple commit Simple commit
unit (no reorder unit (no reorder buffer or other buffer or other complexities)complexities)
Same execution Same execution unit for FP/intunit for FP/int
SPE SummarySPE Summary
Essentially small vector computerEssentially small vector computer Based on Altivec/VMX ISABased on Altivec/VMX ISA
• Extensions for DMA and LS managementExtensions for DMA and LS management• Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile
Uniquely suited for real time applicationsUniquely suited for real time applications Extremely fast for certain FP operationsExtremely fast for certain FP operations Offload a large amount on to compiler / Offload a large amount on to compiler /
software.software.
Element Interconnect BusElement Interconnect Bus
4 concentric rings connecting all Cell 4 concentric rings connecting all Cell elementselements
128-bit wide interconnects128-bit wide interconnects
EIB (cont.)EIB (cont.)
Designed to minimize coupling noiseDesigned to minimize coupling noise Rings of data traveling in alternating Rings of data traveling in alternating
directionsdirections Buffers and repeaters at each SPE Buffers and repeaters at each SPE
boundaryboundary Architecture can be scaled up with Architecture can be scaled up with
increased bus latencyincreased bus latency
EIB (cont.)EIB (cont.)
Total bandwidth at ~200GB/sTotal bandwidth at ~200GB/s EIB controller located physically in EIB controller located physically in
center of chip between SPE’scenter of chip between SPE’s Controller reserves channels for each Controller reserves channels for each
individual data transfer requestindividual data transfer request Implementation allows for SPE Implementation allows for SPE
extension horizontallyextension horizontally
Memory InterfaceMemory Interface
Rambus XDR memory to keep Cell at Rambus XDR memory to keep Cell at full utilizationfull utilization
3.2 Gbps data bandwidth per device 3.2 Gbps data bandwidth per device connected to XDR interfaceconnected to XDR interface
Cell uses dual channel XDR with four Cell uses dual channel XDR with four devices and 16-bit wide buses to devices and 16-bit wide buses to achieve 25.2 GB/s total memory achieve 25.2 GB/s total memory bandwidthbandwidth
Input / Output BusInput / Output Bus
Rambus FlexIO BusRambus FlexIO Bus IO interface consists of 12 IO interface consists of 12
unidirectional byte lanesunidirectional byte lanes Each lane supports 6.4 GB/s Each lane supports 6.4 GB/s
bandwidthbandwidth 7 outbound lanes and 5 inbound 7 outbound lanes and 5 inbound
laneslanes
Design ChoicesDesign Choices
In-order executionIn-order execution• Abandoning ILPAbandoning ILP• ILP – 10-20% increase per generationILP – 10-20% increase per generation• Reducing control logicReducing control logic• Real time responsivenessReal time responsiveness
Cache DesignCache Design• Software configuration on SPESoftware configuration on SPE• Standard L2 cache on PPEStandard L2 cache on PPE
Cell Programming IssuesCell Programming Issues
No Cell compiler in existence to manage No Cell compiler in existence to manage utilization of SPE’s at compile timeutilization of SPE’s at compile time
SPE’s do not natively support context SPE’s do not natively support context switching. Must be OS managed.switching. Must be OS managed.
SPE’s are vector processors. Not efficient SPE’s are vector processors. Not efficient for general-purpose computation.for general-purpose computation.
PPE’s and SPE’s use different instruction PPE’s and SPE’s use different instruction sets.sets.
Cell Programming (cont.)Cell Programming (cont.)
Functional Offload ModelFunctional Offload Model Simplest model for Cell programmingSimplest model for Cell programming Optimize existing libraries for SPE Optimize existing libraries for SPE
computationcomputation Requires no rebuild of main Requires no rebuild of main
application logic which runs on PPEapplication logic which runs on PPE
Cell Programming (cont.)Cell Programming (cont.)
Device Extension ModelDevice Extension Model Take advantage of SPE DMATake advantage of SPE DMA Use SPE’s as interfaces to external Use SPE’s as interfaces to external
devicesdevices
Cell Programming (cont.)Cell Programming (cont.)
Computational Acceleration Model Computational Acceleration Model Traditional super-computing methods Traditional super-computing methods
using Cellusing Cell Shared memory or message passing Shared memory or message passing
paradigm for accelerating inherently paradigm for accelerating inherently parallel math operationsparallel math operations
Can overwrite intensive math Can overwrite intensive math libraries without rewriting libraries without rewriting applicationsapplications
Cell Programming (cont.)Cell Programming (cont.)
Streaming modelStreaming model Use Cell processor as one large Use Cell processor as one large
programmable pipelineprogrammable pipeline Partition algorithms into logically Partition algorithms into logically
sensible steps. Execute each sensible steps. Execute each separately, in serial, on separate separately, in serial, on separate processors.processors.
Cell Programming (cont.)Cell Programming (cont.)
Asymmetric Thread Runtime ModelAsymmetric Thread Runtime Model Abstract Cell architecture away from Abstract Cell architecture away from
programmer.programmer. Use OS to use processors to each run Use OS to use processors to each run
different threads.different threads.
Sample PerformanceSample Performance
Demonstration physics engine for Demonstration physics engine for real-time gamereal-time game
http://www.research.ibm.com/cell/whhttp://www.research.ibm.com/cell/whitepapers/cell_online_game.pdfitepapers/cell_online_game.pdf
182 Compute to DMA ratio on SPE’s182 Compute to DMA ratio on SPE’s For the right tasks, Cell architecture For the right tasks, Cell architecture
can be extremely efficient.can be extremely efficient.