kevin eady ben plunkett prateeksha satyamoorthy

Kevin Eady Ben Plunkett

Prateeksha Satyamoorthy

HistoryJointly designed by Sony, Toshiba,IBM (STI)Design began March 2001First used in Sony’s PlayStation 3IBM’s Roadrunner cluster contains over

12,000 Cell processors.

IBM Roadrunner cluster

Cell Broadband EngineNine coresOne Power Processing Element (PPE)

Main processorEight Synergistic Processing Elements (SPE)

Fully functional co-processorsComprised of Synergistic Processing Unit

(SPU), Memory Flow Controller (MFC)

Stream processing

Power Processing ElementIn-order dual-issue design64-bit Power ArchitectureTwo 32 KB L1 caches (instruction, data),

one 512 KB L2 cacheInstruction Unit: instruction fetch, decode,

branch, issue, completion4 instructions per cycle per thread into bufferdispatches instructions from bufferdual-issued to Execution Unit

Branch prediction: 4-KB x 2-bit branch history table

Pipeline depth: 23 stages

Synergistic Processing Element• Implements new instruction-set architecture • Each SPU contains a dedicated DMA management

queue • 256 KB local store memory

– Stores instructions and data– Memory transferred via DMA between local and system

memory• No data load / branch prediction.

– Relies on "prepare-to-branch" instructions to pre-fetch data

– Loads at least 17 instructions at the branch target address

• Two instructions per cycle– 128-bit SIMD– In-order dual-issue statically scheduled

On-chip Interconnect : Element Interconnect Bus (EIB)Provides internal connection for 12 ‘units’:

PPE8 SPEsMemory Controller (MIC)2 Off-chip I/O interfaces

Each ‘unit’ has one 16B read port and one 16B write port

Circular ring Four 16-byte-wide unidirectional channels

which counter-rotate in pairs

Includes an arbitration unit which functions as a set of traffic lights.

Runs at half the system clock rate Peak instantaneous EIB bandwidth is 96B per

clock 12 concurrent transactions * 16 bytes wide / 2

system clocks per transferEIB channel is not permitted to convey data

requiring more than six steps

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle.

Maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the systemThe theoretical peak data bandwidth on the

EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s197 GB/s Actual peak data bandwidth achieved

David Krolak explains: “Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”

Multi-threading OrganizationPPE is an in-order, 2-way Simultaneous Multi-Threading

(SMT)Each SPU is a vectorial accelerator targeted at the

execution of SIMD codeAll architecture states are duplicated to perform

interleaved instruction issuing. Asynchronous DMA transfers.

The setup of a DMA takes the SPE a few cycles whereas a cache miss on a normal system causes the CPU to stall to up to thousands of cycles.

SPEs can perform other calculations while waiting for data.

Scheduling PolicyTwo classes of threads defined

PPU threads: run on the PPU SPU tasks: run on the SPUs.

PPU threads are managed by the Completely Fair Scheduler (CFS)

SPU scheduler supports time-sharing in multi-programmed workloads Allows preemption of SPU tasks

Cell-based systems allow only one active application to run at the same time to avoid performance degradation.

Completely Fair SchedulerRanked by

Consider an example with two users, A and B, who are running jobs on a machine. User A has just two jobs running, while user B has 48 jobs running. Group scheduling enables CFS to be fair to users A and B, rather than being fair to all 50 jobs running in the system. Both users get a 50-50 share. B would use his 50% share to run his 48 jobs and would not be able to encroach on A's 50% share.

kevin eady ben plunkett prateeksha satyamoorthy

Documents

branch instructions

b of data

maximum data bandwidth

bufferdispatches instructions

completion4 instructions

arbitration unit

stepseach unit

instruction fetch