chip multiprocessing and the cell broadband engine · 8 chip multiprocessing and the cell broadband...

44
Cell Broadband Engine © 2006 IBM Corporation Chip Multiprocessing and the Cell Broadband Engine Michael Gschwind ACM Computing Frontiers 2006

Upload: donhi

Post on 11-Apr-2018

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation

Chip Multiprocessingand the Cell Broadband Engine

Michael Gschwind

ACM Computing Frontiers 2006

Page 2: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation2 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Computer Architecture at the turn of the millenium

“The end of architecture” was being proclaimed– Frequency scaling as performance driver– State of the art microprocessors

•multiple instruction issue•out of order architecture•register renaming•deep pipelines

Little or no focus on compilers– Questions about the need for compiler research

Academic papers focused on microarchitecture tweaks

Page 3: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation3 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

The age of frequency scaling

SCALING:Voltage: V/αOxide: tox /αWire width: W/αGate width: L/αDiffusion: xd /αSubstrate: α * NA

RESULTS:Higher Density: ~α2

Higher Speed: ~αPower/ckt: ~1/α2

Power Density:~Constant

p substrate, doping α*NA

Scaled Device

L/α xd/α

GATEn+ source

n+ drain

WIRINGVoltage, V / α

W/αtox/α

Source: Dennard et al., JSSC 1974.

0.5

0.6

0.7

0.8

0.9

1

710131619222528313437

Total FO4 Per Stage

Perf

orm

ance

bips

deeper pipeline

Page 4: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation4 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Frequency scaling

A trusted standby– Frequency as the tide that raises all boats

– Increased performance across all applications

– Kept the industry going for a decade

Massive investment to keep going– Equipment cost

– Power increase

– Manufacturing variability

– New materials

Page 5: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation5 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

… but reality was closing in!

0

0.2

0.4

0.6

0.8

1

710131619222528313437

Total FO4 Per Stage

Rel

ativ

e to

Opt

imal

FO

4

bipsbips^3/W

0.001

0.01

0.1

1

10

100

1000

0.010.11

Active PowerLeakage Power

Page 6: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation6 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Power crisis

0.1

1

10

100

1000

0.010.11

gate length Lgate (µm)

classic scaling

Tox(Å)

Vdd

Vt

The power crisis is not “natural”– Created by deviating from

ideal scaling theory– Vdd and Vt not scaled by α

• additional performance with increased voltage

Laws of physics– Pchip = ntransistors * Ptransistor

Marginal performance gain per transistor low– significant power increase– decreased power/performance

efficiency

Page 7: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation7 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

The power inefficiency of deep pipelining

0

0.2

0.4

0.6

0.8

1

710131619222528313437

Total FO4 Per Stage

Rel

ativ

e to

Opt

imal

FO

4

bipsbips^3/W

Deep pipelining increases number of latches and switching rate⇒ power increases with at least f2

Latch insertion delay limits gains of pipelining

Power-performance optimal Performance optimal

Source: Srinivasan et al., MICRO 2002

deeper pipeline

Page 8: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation8 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Hitting the memory wall

Latency gap– Memory speedup lags behind processor speedup

– limits ILP

Chip I/O bandwidth gap– Less bandwidth per MIPS

Latency gap as application bandwidth gap

– Typically (much) less than chip I/O bandwidth

-60

-40

-20

0

20

40

60

80

100

1990 2003

norm

aliz

ed

MFLOPS Memory latency

Source: McKee, Computing Frontiers 2004

Source: Burger et al., ISCA 1996

usable bandwidth avg. request size

roundtrip latencyno. of inflight

requests

Page 9: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation9 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Our Y2K challenge

10× performance of desktop systems

1 TeraFlop / second with a four-node configuration

1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design”

scalable design across a range of design points

mass-produced and low cost

Page 10: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation10 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Cell Design Goals

Provide the platform for the future of computing– 10× performance of desktop systems shipping in 2005

Computing density as main challenge

–Dramatically increase performance per X• X = Area, Power, Volume, Cost,…

Single core designs offer diminishing returns on investment– In power, area, design complexity and verification cost

Page 11: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation11 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Necessity as the mother of invention

Increase power-performance efficiency–Simple designs are more efficient in terms of power and area

Increase memory subsystem efficiency–Increasing data transaction size–Increase number of concurrently outstanding transactions⇒ Exploit larger fraction of chip I/O bandwidth

Use CMOS density scaling–Exploit density instead of frequency scaling to deliver increased aggregate performance

Use compilers to extract parallelism from application–Exploit application parallelism to translate aggregate performance to application performance

Page 12: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation12 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Exploit application-level parallelism

Data-level parallelism– SIMD parallelism improves performance with little overhead

Thread-level parallelism– Exploit application threads with multi-core design approach

Memory-level parallelism– Improve memory access efficiency by increasing number of

parallel memory transactions

Compute-transfer parallelism– Transfer data in parallel to execution by exploiting application

knowledge

Page 13: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation13 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Concept phase ideas

ChipMultiprocessor

efficient use of memory interface

large architected register set

high frequency and low power!

reverse Vddscaling for low

power

modular design design reuse

control cost of coherency

Page 14: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation14 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Cell Architecture

Heterogeneous multi-core system architecture– Power Processor

Element for control tasks– Synergistic Processor

Elements for data-intensive processing

Synergistic Processor Element (SPE) consists of – Synergistic Processor

Unit (SPU)– Synergistic Memory Flow

Control (SMF)

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycleL2

32B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

Page 15: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation15 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Shifting the Balance of Power with Cell Broadband Engine

Data processor instead of control system– Control-centric code stable over time– Big growth in data processing needs

• Modeling• Games• Digital media• Scientific applications

Today’s architectures are built on a 40 year old data model– Efficiency as defined in 1964– Big overhead per data operation– Parallelism added as an after-thought

Page 16: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation16 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Powering Cell – the Synergistic Processor Unit

VRFVRF

PERMPERM LSULSUVV FF PP UU VV FF XX UU

Local StoreLocal Store

Single Port Single Port SRAMSRAM

Issue /Issue /BranchBranch

Fetch

ILB

16B x 2

2 instructions

16B x 3 x 2

64B64B

16B16B

128B128B

Source: Gschwind et al., Hot Chips 17, 2005

SMFSMF

Page 17: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation17 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Density Computing in SPEs

Today, execution units only fraction of core area and power– Bigger fraction goes to other functions

• Address translation and privilege levels• Instruction reordering• Register renaming• Cache hierarchy

Cell changes this ratio to increase performance per area and power– Architectural focus on data processing

• Wide datapaths• More and wide architectural registers• Data privatization and single level processor-local store• All code executes in a single (user) privilege level • Static scheduling

Page 18: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation18 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Streamlined ArchitectureArchitectural focus on simplicity– Aids achievable operating frequency– Optimize circuits for common performance case– Compiler aids in layering traditional hardware functions– Leverage 20 years of architecture research

Focus on statically scheduled data parallelism– Focus on data parallel instructions

• No separate scalar execution units• Scalar operations mapped onto data parallel dataflow

– Exploit wide data paths• data processing • instruction fetch

– Address impediments to static scheduling• Large register set • Reduce latencies by eliminating non-essential functionality

Page 19: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation19 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

SPE HighlightsRISC organization

– 32 bit fixed instructions– Load/store architecture– Unified register file

User-mode architecture– No page translation within SPU

SIMD dataflow– Broad set of operations (8, 16, 32, 64 Bit)– Graphics SP-Float– IEEE DP-Float

256KB Local Store– Combined I & D

DMA block transfer– using Power Architecture memory

translation

LS

LS

LS

LSGPR

FXU ODD

FXU EVN

SFPDP

CO

NTR

OL

CHANNEL

DMA SMMATO

SBIRTB

BEB

FWD

14.5mm2 (90nm SOI)Source: Kahle, Spring Processor Forum 2005

Page 20: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation20 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

dataparallelselect

staticprediction

simpler μarch

sequentialfetch

localstore

highfrequency

shorterpipeline

largebasic

blocks

determ.latency

largeregister

file

instructionbundling

DLP

widedata

paths

computedensity

Synergistic Processing

singleport

staticscheduling

scalarlayering

ILPopt.

Page 21: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation21 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Efficient data-sharing between scalar and SIMD processing

Legacy architectures separate scalar and SIMD processing– Data sharing between SIMD and scalar processing units

expensive• Transfer penalty between register files

– Defeats data-parallel performance improvement in many scenarios

Unified register file facilitates data sharing for efficient exploitation of data parallelism– Allow exploitation of data parallelism without data transfer

penalty• Data-parallelism always an improvement

Page 22: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation22 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

SPU PIPELINE FRONT END

SPU PIPELINE BACK ENDBranch Instruction

Load/Store Instruction

IF Instruction FetchIB Instruction BufferID Instruction DecodeIS Instruction IssueRF Register File AccessEX ExecutionWB Write Back

Floating Point Instruction

Permute InstructionEX1 EX2 EX3 EX4

EX1 EX2 EX3 EX4 EX5 EX6

EX1 EX2 WB

WB

RF1 RF2

WB

Fixed Point Instruction

EX1 EX2 EX3 EX4 EX5 EX6 WB

IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2

SPU Pipeline

Page 23: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation23 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Permute UnitLoad-Store Unit

Floating-Point UnitFixed-Point Unit

Branch UnitChannel Unit

Result Forwarding and StagingRegister File

Local Store(256kB)

Single Port SRAM

128B Read 128B Write

DMA Unit

Instruction Issue Unit / Instruction Line Buffer

8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle

On-Chip Coherent Bus

SPE Block Diagram

Page 24: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation24 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Synergistic Memory Flow Control

Data BusSnoop BusControl BusTranslate Ld/StMMIO

DMA Engine DMAQueue

RMTMMUAtomicFacility

Bus I/F Control

LS

SXUSPU

MMIO

SPE

SMF

SMF implements memory management and mappingSMF operates in parallel to SPU– Independent compute and transfer– Command interface from SPU

• DMA queue decouples SMF & SPU• MMIO-interface for remote nodes

Block transfer between system memory and local storeSPE programs reference system memory using user-level effective address space – Ease of data sharing– Local store to local store transfers– Protection

Page 25: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation25 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Synergistic Memory Flow Control

Bus Interface– Up to 16 outstanding DMA

requests– Requests up to 16KByte– Token-based Bus Access

Management

Source: Clark et al., Hot Chips 17, 2005

Element Interconnect Bus– Four 16 byte data rings– Multiple concurrent transfers– 96B/cycle peak bandwidth – Over 100 outstanding requests– 200+ GByte/s @ 3.2+ GHz

16B 16B 16B 16B

Data Arb

16B 16B 16B 16B

16B 16B16B 16B16B 16B16B 16B

16B

16B

16B

16B

16B

16B

16B

16B

SPE0 SPE2 SPE4 SPE6

SPE7SPE5SPE3SPE1

MIC

PPE

BIF/IOIF0

IOIF1

Page 26: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation26 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Memory efficiency as key to application performance

Greatest challenge in translating peak into app performance– Peak Flops useless without way to feed data

Cache miss provides too little data too late– Inefficient for streaming / bulk data processing – Initiates transfer when transfer results are already needed– Application-controlled data fetch avoids not-on-time data delivery

SMF is a better way to look at an old problem– Fetch data blocks based on algorithm

– Blocks reflect application data structures

Page 27: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation27 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Traditional memory usage

Long latency memory access operations exposed

Cannot overlap 500+ cycles with computation

Memory access latency severely impacts application performance

computation mem protocol mem idle mem contention

Page 28: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation28 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Exploiting memory-level parallelism Reduce performance impact of memory accesses with concurrent accessCarefully scheduled memory accesses in numeric codeOut-of-order execution increases chance to discover more concurrent accesses– Overlapping 500 cycle latency with computation using OoO illusory

Bandwidth limited by queue size and roundtrip latency

Page 29: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation29 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Exploiting MLP with Chip MultiprocessingMore threads more memory level parallelism– overlap accesses from multiple cores

Page 30: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation30 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

A new form of parallelism: CTP

Compute-transfer parallelism– Concurrent execution of compute and transfer increases

efficiency• Avoid costly and unnecessary serialization

– Application thread has two threads of control• SPU computation thread• SMF transfer thread

Optimize memory access and data transfer at the application level– Exploit programmer and application knowledge about

data access patterns

Page 31: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation31 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Memory access in Cell with MLP and CTPSuper-linear performance gains observed– decouple data fetch and use

Page 32: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation32 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

multicore

manyoutstanding

transfers

sequentialfetch

highthroughput

timely data

delivery

shared I & D

reducemiss

complexity

parallelcompute &

transfer

improvedcode gen

decouplefetch andaccess

explicitdata

transfer

storagedensity

Synergistic Memory Management

singleport

applicationcontrol

reducedcoherence

cost

lowlatencyaccess

block fetch

deeper fetchqueue

localstore

Page 33: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation33 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Cell BE Implementation Characteristics

241M transistors235mm2

Design operates across wide frequency range– Optimize for power & yield

> 200 GFlops (SP) @3.2GHz> 20 GFlops (DP) @3.2GHzUp to 25.6 GB/s memory bandwidthUp to 75 GB/s I/O bandwidth100+ simultaneous bus transactions– 16+8 entry DMA queue per SPE

Rel

ativ

e

Frequency Increase vs. Power Consumption

Voltage

Source: Kahle, Spring Processor Forum 2005

Page 34: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation34 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Cell Broadband Engine

Page 35: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation35 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

System configurations

Cell BEProcessor

XDR™ XDR™

IOIF

BIF

Cell BEProcessor

XDR™ XDR™

IOIF

Cell BEProcessor

XDR™ XDR™

IOIF

BIF

Cell BEProcessor

XDR™ XDR™

IOIFswitch

Cell BEProcessor

XDR™ XDR™

IOIF BIF

Cell BEProcessor

XDR™ XDR™

IOIF

Cell BEProcessor

XDR™ XDR™

IOIF0 IOIF1

Game console systems

Blades

HDTV

Home media servers

Supercomputers

CellDesign

Page 36: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation36 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Compiling for the Cell Broadband Engine

The lesson of “RISC computing”– Architecture provides fast, streamlined primitives to compiler– Compiler uses primitives to implement higher-level idioms– If the compiler can’t target it do not include in architecture

Compiler focus throughout project– Prototype compiler soon after first proposal– Cell compiler team has made significant advances in

• Automatic SIMD code generation• Automatic parallelization• Data privatization Programmability

ProgrammerProductivity

Raw HardwarePerformance

CellDesign

Page 37: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation37 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Single source automatic parallelizationSingle source compiler– Address legacy code and programmer productivity– Partition a single source program into SPE and PPE functions

Automatic parallelization– Across multiple threads– Vectorization to exploit SIMD units

Compiler managed local store– Data and code blocking and tiling– “Software cache” managed by compiler

Range of programmer input– Fully automatic– Programmer-guided

• pragmas, intrinsics, OpenMP directivesPrototype developed in IBM Research

Page 38: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation38 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Compiler-driven performance

Code partitioning– Thread-level parallelism– Handle local store size– “Outlining” into SPE overlays

Data partitioning– Data access sequencing– Double buffering– “Software cache”

Data parallelization– SIMD vectorization– Handling of data with .unknown alignment

Source: Eichenberger et al., PACT 2005

Page 39: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation39 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Raising the bar with parallelism…

Data-parallelism and static ILP in both core types

– Results in low overhead per operationMultithreaded programming is key to great Cell BE performance– Exploit application parallelism with 9 cores – Regardless of whether code exploits DLP & ILP– Challenge regardless of homogeneity/heterogeneityLeverage parallelism between data processing and data transfer– A new level of parallelism exploiting bulk data transfer– Simultaneous processing on SPUs and data transfer on SMFs– Offers superlinear gains beyond MIPS-scaling

Page 40: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation40 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

… while understanding the trade-offsUniprocessor efficiency is actually low– Gelsinger’s Law captures historic (performance) efficiency

• 1.4x performance for 2x transistors– Marginal uni-processor (performance) efficiency is 40% (or lower!)

• And power efficiency is even worse

The “true” bar is marginal uniprocessor efficiency– A multiprocessor “only” has to beat a uniprocessor to be the better

solution

– Many low-hanging fruit to be picked in multithreading applications• Embarrassing application parallelism which has not been exploited

Page 41: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation41 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Cell: a Synergistic System Architecture

Cell is not a collection of different processors, but a synergistic whole– Operation paradigms, data formats and semantics consistent

– Share address translation and memory protection model

SPE optimized for efficient data processing– SPEs share Cell system functions provided by Power Architecture

– SMF implements interface to memory• Copy in/copy out to local storage

Power Architecture provides system functions– Virtualization

– Address translation and protection

– External exception handling

Element Interconnect Bus integrates system as data transport hub

Page 42: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation42 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

A new wave of innovation

CMPs everywhere– same ISA CMPs– different ISA CMPs– homogeneous CMP– heterogeneous CMP– thread-rich CMPs– GPUs as compute engine

Compilers and software must and will follow…

Novel systems– Power4– BOA– Piranha– Cyclops– Niagara– BlueGene– Power5– Xbox360– Cell BE

Page 43: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation43 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

Conclusion

Established performance drivers are running out of steamNeed new approach to build high performance system– Application and system-centric optimization– Chip multiprocessing – Understand and manage memory access– Need compilers to orchestrate data and instruction flow

Need new push on understanding workload behaviorNeed renewed focus on compilation technologyAn exciting journey is ahead of us!

Page 44: Chip Multiprocessing and the Cell Broadband Engine · 8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation ... • Scalar operations mapped onto data parallel

Cell Broadband Engine

© 2006 IBM Corporation44 Chip Multiprocessing and the Cell Broadband EngineM. Gschwind, Computing Frontiers 2006

© Copyright International Business Machines Corporation 2005,6.All Rights Reserved. Printed in the United States May 2006.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document areNOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resultin death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or changeIBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnityunder the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specificenvironments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be reliedupon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liablefor damages arising directly or indirectly from any use of the information contained in this document.