multithreaded processors dezső sima spring 2007 (ver. 2.1) dezső sima, 2007

Multithreaded Processors

Dezső Sima

Spring 2007

(Ver. 2.1) Dezső Sima, 2007

Overview

1. Introduction

3. Thread scheduling•

2. Overview of multithreaded cores•

4. Case examples•

4.1. Coarse grained multithreaded cores•

4.2. Fine grained multithreaded cores•

•

4.3. SMT cores•

1. Introduction

to raise performance (beyond superscalar or EPIC execution) by introducing and utilizing finer grained parallelism than multitasking at execution.

flow of control

dynamic sequence of instructions to be executed.

Aim of multithreading

Thread

1. Introduction (1)

Sequential

programming

Multitasked programming Multithreaded programming

P2

Pro

cess

/ T

hrea

d M

anag

emen

t Exa

mpl

e

P1

P2

P3

P2

P1

P3

fork()

fork()

join()

P1

T1

exec()

exec()

T3

T2

T5T4

CreateThread()

T6

Create Process()

1. Introduction (2)

Figure 1.1: Principle of sequential-, multitasked- and multithreaded programming

• belong to the same process,

• share usually a common address space (else multiple address translation paths (virtual to real) need to be maintained concurrently),

• are executed concurrently (simultaneously (i.e. overlapped by time sharing) or in parallel), depending on the impelmentation of multithreading .

• creation, control and termination of individual threads,

• context swithing between threads,

• maintaining multiple sets of thread states.

Threads

• thread program state (state of the ISA) including the PC, FX/FP architectural registers, state registers,

• thread microstate (supplementary state of the microarchitecture) including the rename register mappings, branch history, ROB etc.

Main features of multithreading

Main tasks of thread management

Basic thread states

1. Introduction (3)

Software multithreading

Implementation of multithreading

Hardware multithreading

Execution of multithreaded apps/OSson a single threaded processor

simultaneously (i.e. by time sharing)

Execution of multithreaded apps/OSson a multithreaded processor

concurrently

(while executing multithreaded apps/OSs)

Fast context swithing between threads required.

Maintaining multiple threads simultaneously by the OS

Maintaining multiple threads concurrently by the processor

Multithreaded OSs Multithreaded processors

1. Introduction (4)

Multithreaded processors

Multicore processors Multithreaded cores

Chip

L3/Memory

L2/L3Core Core

L3/Memory

MTcore

L2/L3

(SMP: Symmetric MultiprocessingCMP: Chip Multiprocessing)

1. Introduction (5)

• Maintaining multiple thread program states concurrently by the processor,

including the PC, FX/FP architectural registers, state registers

• Maintaning multiple thread microstates, pertaining to:

rename register mappings, the RAS (Return Address Stack), theROB, etc.

• Providing increased sizes for scarce or sensitive resorces, such as:

the instruction buffer, store queue,in case of merged arch. and rename registersappropriatly large file sizes (FX/FP) etc.

including the PC, FX/FP architectural registers, state registers

Maintaining multiple thread program states concurrently by the OS,

• Implementing individual per thread structures, like 2 or 4 sets of FX registers,

• Implementing tagged structures, like a tagged ROB, a tagged buffer etc.

Requirement of software multithreading

Options to provide multiple states

1. Introduction (6)

Core enhancements needed in multithreaded cores

Multicore processors

Multithreaded cores

Additional complexity

~ (60 – 80) % ~ (2 – 10) %

Additional gain

(in gen. purp. apps)

~ (60 – 80) % ~ (0 – 30) %

1. Introduction (7)

• Windows NT• OS/2

• Unix w/Posix

• most OSs developed from the 90’s on

Multithreaded OSs

1. Introduction (8)

Sequential programs

Multitasked programs

Multithreaded programs

Software implementation



on a multithreaded

core

on a multicore proc.

Single process on a single processor

Multiple processes on a single processor using time sharing

Multithreaded software on a single threaded processor using time sharing

Multithreaded software on a multithreaded core

Multithreaded software on a multicore processor

No issues with parallel programs

Multiple programs with quasi-parallel execution

Private address spaces

Multiple programs with quasi-parallel execution

Shared process address spaces

Thread context switch needed

Simultaneous execution of threads

Threads share address space

No thread context switches needed (except coarse grained MT)

True parallel execution of threads

Threads share address space

No thread context switches needed

Sequential bottleneck

Solutions for fast context switching

Thread state management and context switching

Thread schedulingIntra-core communication

Descri

pti

on

Key F

eatu

res

Key

Issu

es

Introduction (9)

Figure 1.2: Contrasting sequential-, multitasked- and multithreaded execution (2)

Sequential programs

Multitasked programs

Multithreaded programs

Software implementation



on a multithreaded

core

on a multicore proc.

Legacy OS support

Traditional UnixMost modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)

Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)

Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)

Low Low-medium High Higher Highest

No API level support

Process life cycle management API

Process and thread management APIExplicit threading APIOpenMP



OS

Su

pp

ort

Soft

ware

D

evelo

pm

en

tP

erf

orm

an

ce

Level

Introduction (10)

Figure 1.3: Contrasting sequential-, multitasked- and multithreaded execution (2)

2. Overview of multithreaded cores

Figure 2.1: Intel’s multithreaded desktop families

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

Pentium EE 955/965(Presler)

2*188 mtrs./130 W

1/06

65 nm/2*81 mm2

2-way MT/core

Pentium EE 840(Smithfield)

230 mtrs./130 W

5/05

90 nm/2*103 mm2

2-way MT/core

2002 2004 20061H 2H

Pentium 4(Northwood B)

55 mtrs./82 W

11/02

130 nm/146 mm2

2-way MT

Pentium 4(Prescott)

125 mtrs./103 W

02/04

90 nm/112 mm2

2-way MT

2. Overview of multithreaded cores (1)

Figure 2.2.: Intel’s multithreaded Xeon DP-families

Xeon DP 2.8

10/05

2*169 mtrs./135 W 90 nm/2*135 mm2

(Paxville DP)

2-way MT/core2*188 mtrs./95/130 W

Xeon 5000

6/06

65 nm/2*81 mm2

(Dempsey)

2-way MT/core

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

2002 2004 20061H 2H

Pentium 4

2/02

55 mtrs./55 W 130 nm/146 mm2

(Prestonia-A)

2-way MT

Pentium 4

11/03

169mtrs./110 W 130 nm/135 mm2

(Irwindale-A)

2-way MT

Pentium 4

6/04

125 mtrs./103 W 90 nm/112 mm2

(Nocona)

2-way MT


Xeon 7000

11/05

2*169 mtrs./95/150 W 90 nm/2*135 mm2

(Paxville MP)

2-way MT/core1328 mtrs./95/150 W

Xeon 7100

8/06

65 nm/435 mm2

(Tulsa)

2-way MT/core

Figure 2.3.: Intel’s multithreaded Xeon MP-families

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

2002 2004 20061H 2H

Pentium 4

3/02

108 mtrs./64 W 180 nm/ n/a

(Foster-MP)

2-way MT

Pentium 4

3/04

178/286 mtrs./77 W 130 nm/310 mm2

(Gallatin)

2-way MT

Pentium 4

3/05

675 mtrs./95/129 W 90 nm/339 mm2

(Potomac)

2-way MT


Figure 2.4.: Intel’s multithreaded EPIC based server family

2-way MT/core

9x00(Montecito)

1720 mtrs./104 W

7/06

90 nm/596 mm2

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

2002 2004 20061H 2H


Figure 2.5.: IBM’s multithreaded server families

POWER5 POWER5+

Cell BE PPE

POWER6

5/04 10/05

276 mtrs./70 W

2006

234* mtrs./95* W

2007

750 mtrs./~100W90 nm/230 mm2

90 nm/221* mm2

65 nm/341 mm2

276 mtrs./80W (est.)

130 nm/389 mm2

2-way MT/core 2-way MT/core

2-way MT(*: entire proc.)

2-way MT/core

RS 64 IV(Sstar)

5/04

44 mtrs./n/a

180 nm/n/a

2-way MT

8CMT

QCMT

DCMT

SCMT

2004 20061H 2H 1H 2H 1H 2H 1H 2H

2000 2005 20071H 2H

~~


UltraSPARC T1

11/2005

279 mtrs./63 W

(Niagara)UltraSPARC T2

2007

(Niagara II)

APL SPARC64 VI

2007

540 mtrs./120 W

(Olympus)

APL SPARC64 VII

2008

(Jupiter)

90 nm/421 mm2

65 nm/464 mm2

65 nm/342 mm290 nm/379 mm2

4-way MT/core 8-way MT/core

2-way MT/core

~120 W2-way MT/core

72 W (est.)

Figure 2.6: Sun’s and Fujitsu’s multithreaded server families

8CMT

QCMT

DCMT

SCMT

2005 20071H 2H 1H 2H 1H 2H 1H 2H

2004 2006 20081H 2H


XLR 5xx

5/05

333 mtrs./10-50 W90 nm/~220 mm2

4-way MT/core

Figure 2.7: RMI’s multithreaded XLR family (scalar RISC)

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

2002 2004 20061H 2H


Alpha 21464 (V8)

2003

250 mtrs./10-50 W130 nm/ n/a

4-way MT

Figure 2.8: DEC’s/Compaq’s multithreaded processor

8CMT

QCMT

DCMT

SCMT

2003 20051H 2H 1H 2H 1H 2H 1H 2H

2002 2004 20061H 2H

Cancelled 6/2001


Scalar core(s)

Underlying core(s)

Superscalar core(s) VLIW core(s)

SUN UltraSPARC T1 (2005)(Niagara)

up to 8 cores/4T

RMI XLR 5xx (2005)8 core/4T

IBM RS64 IV (2000)(SStar)

Single-core/2T

Pentium 4 based processors

Single-core/2T (2002-)Dual-core/2T (2005-)

DEC 21464 (2003)Single-core/4T

IBM POWER5 (2005)Dual-core/2T

PPE of Cell BE (2006)Single-core/2T

Fujitsu SPARC64 VI / VIIDual-core/Quad-core/2T

SUN MAJC 5200 (2000)Quad-core/4T

(dedicated use)

Intel Montecito (2006)Dual-core/2T


3. Thread scheduling

Thread scheduling in software multithreading on a traditional supercalar processor

Figure 3.1: Thread scheduling assuming software multithreading on a 4-way superscalar processor

The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread

and loading the state of the thread to be executed next).

Thread1 Context switch Thread2

Dis

patc

h s

lots

Clock cycles

3. Thread scheduling (1)

Figure 3.2: Thread scheduling in a dual core processor

Thread scheduling in multicore processors (CMP-s)

Both t-way superscalar cores execute different threads independently.

Thread1 Thread2

Dis

patc

h s

lots

Clock cycles


Coarse grained MT

Thread scheduling in multithreaded cores


Figure 3.3: Thread scheduling in a 4-way coarse grained multithreaded processor

Threads are switched by means of rapid, HW-supported context switches.


Dis

patc

h/iss

ue

slo

ts

Clock cycles

Thread1 Context switch Thread2

Scalar based

Coarse grained MT

Superscalar based VLIW based

SUN MAJC 5200 (2000)Quad-core/4T

(dedicated use)

Intel Montecito (2006?)Dual-core/2T

IBM RS64 IV (2000)(SStar)

Single-core/2T


Coarse grained MT Fine grained MT



Figure 3.4: Thread scheduling in a 4-way fine grained multithreaded processor

The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle..


Thread1 Thread3 Thread4Thread2

Dis

patc

h/iss

ue

slo

ts

Clock cycles

Round robin selection policy

Fine grained MT

Priority based selection policy

Scalar based

Superscalar based

VLIW based

Scalar based

Superscalar based

VLIW based

SUN UltraSPARC T1 (2005)(Niagara)

up to 8 cores/4T

PPE of Cell BE (2006)single-core/2T


Coarse grained MT Fine grained MT Simultaneous MT (SMT)



Figure 3.5: Thread scheduling in a 4-way symultaneous multithreaded processor

Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle.

SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).

Thread1 Thread3 Thread4Thread2


Clock cycles

Dis

patc

h/iss

ue

slo

ts

Scalar based

SMT cores

Superscalar based VLIW based

Pentium 4based proc.s Single-core/2T (2002-)Dual-core/2T (2005-)

DEC 21464 (2003)Dual-core/4T

(canceled in 2001)

IBM POWER5 (2005)Dual-core/2T


4. Case examples

4.1. Coarse grained multithreading

4.2. Fine grained multithreading

4.3. SMT multithreading

4.1 Coarse grained multithreaded processors

4.1.1. IBM RS64 IV

4.1.2. SUN MAJC 5200

4.1.3. Intel Montecito



4.1. Coarse grained multithreaded processors

Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning).

4-way superscalar, dual-threaded.

Used in IBM’s iSeries and pSeries commercial servers.

• large working sets,• poor locality of references and• frequently occurring task switches

• high cache miss rates,

• Memory bandwidth and latency strongly limits performance.

need for high instruction and data fetch bandwidth,need for large L1 $s,using multithreading to hide memory latency.

Microarchitecture

Characteristics of server workloads:

4.1.1. IBM RS 64 IV (1)

• 128 KB L1 D$ and L1 I$,

• instruction fetch width: 8 instr./cycle,

• dual-threaded core.

Main microarchitectural features of the RS64 IV to support commercial workloads:

4.1.1. IBM RS 64 IV (2)

Figure 4.1.1: Microarchitecture of IBM’s RS 64 IV

Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898

6XX bus

IERAT: Effective to realaddress translation cache

(2x64 entries)

4.1.1. IBM RS 64 IV (3)

Coarse grained MT with two Ts; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the long latency event is serviced, a T switch occurs back to the foreground T.

For multithreading additionally needed die area: ~ + 5 % die area

• GPRs, FPRs, CR (condition reg.), CTR (count reg.),• spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..)• status and control reg.s, such as T priority.

Each T executes in its own effective address space (an unusual feature of multithreaded cores).

Dual architectural states maintained for:

Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s)

Both single threaded and multithreaded modes of execution.

Thread Swith Buffer holds up to 8 instructions from the background T, to shorten context swithching by eliminating the latency of the I$

Threads can be allocated different priorities by explicit instructions.

Multithreading policy (strongly simplified)

Implementation of multithreading

4.1.1. IBM RS 64 IV (4)

Figure 4.1.2: Thread switch on data cache miss in IBM’s RS 64 IV

Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898

4.1.1. IBM RS 64 IV (5)

(8 cycles penalty)

(2 instructions/cycles)

• up to 4 processors on a die, • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced,• each FU has its private logic and register set (e.g. 32 or 64 regs.,• the 4 FUs of a processor share a set of global regs., e.g. 64 regs.,• all registers are unified (not splitted into FX/FP files),• any FU can process any data type.

Each processor is a 4-wide VLIW and can be 4-way multithreaded.

Dedicated use, high-end graphics, networking with wire-speed computational demands.

Aim:

Microarchitecture:

4.1.2. SUN MAJC 5200 (1)

Figure 4.1.3: General view of SUN’s MAJC 5200

Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc

4.1.2. SUN MAJC 5200 (2)

Figure 4.1.4: The principle of private, unified register files associated with each FU


4.1.2. SUN MAJC 5200 (3)

Each processor with its 4 FUs can be operated in a 4-way multithreaded mode(called Vertical Multithreading by Sun)

Implementation of 4-way multithreading:

by executing each T by one of the 4 FUs („Vertical multithreading”)

Following a cache miss, the processor saves the T state and begins to process the next T.

Comparison of program execution without and with multithreading on a 4-wide VLIW

Considered program:

• It consists of 100 instructions,• on average 2.5 instrs./cycle executed on average,• giving birth to a cache miss after each 20 instructions.• Latency of serving a cache miss: 75 cycles.

Threading

Thread switch

Example

4.1.2. SUN MAJC 5200 (4)

Figure 4.1.5: Execution for subsequent cache misses in a single threaded processor


4.1.2. SUN MAJC 5200 (5)

Figure 4.1.6: Execution for subsequent cache misses in SUN’s MAJC 5200


4.1.2. SUN MAJC 5200 (6)

• Split L2 caches for data and instructions,• larger unified L3 cache (for each core),• duplicated architectural states for

FX/FP-registers, branch and predicate registers, next address register

maintained.• (Foxton technology for power management/frequency boost,

planned but not implemented).

• the branch prediction structures provide T tagging,• per thread return address stacks,• per thread ALATs (Advance Load Address Table)

Additional core area needed for multithreading: ~ 2 %.

High end serversAim:

Main enhancements of Montecito over Itanium2

Additional support for dual-threading (duplicated microarchitectural states)

4.1.3. Intel Montecito (1)

Figure 4.1.7: Microarchitecture of Intel’s Itanium 2

Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55


Figure 4.1.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table)

Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20


5 event types cause thread switches, such as L3 cache misses, programmed switched hints.

Total switch penalty: 15 cycles

If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.

Thread switches

Example for thread switching


Figure 4.1.9: Thread switch in Intel’s Montecito vs single thread execution

Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20


4.2 Fine grained multithreaded processors

4.2.1. SUN Ultrasparc T1

4.2.2. PPE of Cell BE



4.2. Fine grained multithreaded processors

• web servicing,• transaction processing,• ERP (Enterprise Resource Planning),• DSS (Decision Support Systems)

Commercial server applications, such as

• large working sets,• poor locality of memory references.

• high cache miss rates,• low prediction accuracy for data dependent branches.

Memory latency strongly limits performance.

Multithreading to hide memory latency.

Aim

Characteristics of commercial server applications

4.2.1. SUN UltraSPARC T1 (1)

• 8 scalar cores, 4-way multithreaded each.

Structure

• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,


Figure 4.2.1: Block diagram of SUN’s UltraSPARC T1

Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29


• 8 scalar cores, 4-way multithreaded each.

Structure

• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,• 4 memory channels with on chip DDR2 memory controllers.

It runs under Solaris.


Figure 4.2.2: SUN’s UltraSPARC T1 chip

Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf


Processor Elements (SPARC pipes):

• Scalar FX-units, 6-stage pipeline• all Processor Elements share a single FP-unit


Figure 4.2.3: Microarchitecture of the core of SUN’s UltraSPARC T1



Processor Elements (SPARC pipes):

Each thread of a processor element has its private:

• PC-logic• register file,• instruction buffer,• store buffer.



Processor Elements (Sparc pipes):

Each thread of a processor element has its private:

• PC-logic,• register file,• instruction buffer,• store buffer.

No thread switch penalty!



Thread switch:

Threads are switched on a per cycle basis.

Selection of threads:

In the thread select pipeline stage

• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and

• fetches the following instr. of the same thread into the instruction buffer.


Thread switch:

Threads are switched on a per cycle basis.

In the thread select pipeline stage• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and

• fetches the following instr. of the same thread into the instruction buffer.

Selection of threads:

Threads become unavailable due to:

• long-latency instructions, such as loads, branches, multiplies, divides,• pipeline stalls because of cache misses, traps, resource conflicts.

Thread selection policy: the least recently used policy.

1. Example:

• all 4 threads are available.


Figure 4.2.6: Thread switch in the SUN’s UltraSPARC T1 when all threads are available


t1-sub


2. Example:

•There are only 2 threads available,•speculative execution of instructions following a load.

(Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit. So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.)


Figure 4.2.7: Thread switch in the SUN’s UltraSPARC T1 when only two threads are available


(Thread t0 issues a ld instruction and becomes unavailable for two cycles.The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.)

t1-sub

t0 yetunavailable

ld data available


4.2.2. Cell BE

Overview of the Cell BE

Processor components

Multithreading the PPE

Programming models

Implementation of the Cell BE

Cell BE

Objective: Speeding up game/multimedia apps.

Used: In the PlayStation 3 (PS3) and in the QS20 Blade Server

Goal: 100 times the PS 2 performance.

History

Summer 2000: High level architectural discussionsEnd 2000:Architectural conceptMarch 2001: Design Center opened in Austin TX.Spring 2004: Single Cell BE operationalSummer 2004: 2-way SMP operationalFebr. 2005: First technical disclosuresOct. 2005: Mercury announces Cell BladeNov. 2005: Open Source SDK & Simulator publishedFebr. 2006: IBM announced Cell Blade QS20

CELL BE: Collaborative effort from Sony, IBM and Toshiba

Cell BE at NIK

May 2007: QS20 arrives at NIK within IBM’s loan program

Overview of the Cell BE (1)

• 9 cores;

the PPE (Power Processing Element), a dual threaded, dual issue 64-bit Power PC compliant processor and

8 SPEs (Synergistic Processing Elements),single threaded, dual issue 128-bit SIMD processors.

• the EIB (Element Interconnection Bus, an on-chip interconnetion network,

• the MIC (Memory Interface Controller), a Memory Controller supporting

dual Rambus XDR channels and

• the BIC (Bus Interface Controller) that interfaces the Rambus Flex IO bus.


Main functional units of the Cell BE

EIB: Element Interface Bus

Figure 4.2.8: Block diagram of the Cell BE [4.2.2.1]

SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit

PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit

MIC: Memory Interface Contr.BIC: Bus Interface Contr.

XDR: Rambus DRAM


a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)

The PPE• is optimized to run a 32/64-bit OS• controls usually the SPEs,• complies with the 64-bit PowerPC ISA.

• the PPE is more adept at control-intensive tasks and quicker in task switching,• the SPEs are more adept at compute intensive tasks and slower at task switcing.

• are optimized to run compute intensive SIMD apps.,• operate usually under the control of the PPE,• run their individual apps. (threads),• have full access to a coherent shared memory including the memory mapped I/O-space,• can be programmed in C/C++.

The SPEs

Contrasting the PPE and the SPEs

Unique features of the Cell BE


b) The SPEs have an unusual storage architecture, as

• SPEs

• The LS

access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands).

operate in connection with a local store (LS) of 256 KB, i.e.

o they fetch instructions from their private LS and

o their Load/Store-instructions access their LS rather than the main store,

• SPEs

has no associated cache.


Although the PPE and the SPEs have coherent access to main memory,

the Cell BE is not a traditional shared-memory multiprocessor

as SPEs operate in connnection with the LS rather than with the main memory.


• Fully compliant 64-bit Power processor (Architecture Specification 2.02)

• fc = 3.2 GHz (11 FO4 design, 23 pipeline stages).

• Dual-issue, in-order superscalar, two-way (fine grained) multithreaded core.

• Conventional cache architecture of 32 KB I$, 32 KB D$, 512 KB unified L2.

PPE (Power Processing Element) [4.2.2.2]

Processor components of the Cell BE (1)

Figure 4.2.9: Main functional units of the PPE [4.2.2.3]


• IU (Instruction unit)

• Shared decode, dependency checking and issue pipeline

• Microcode Engine

predecodes instructions while loading them from the L2 cache to the L1 cache,

fetches 4 instructions per cycle alternating between the two threads from the L1 instr. cache into two instruction buffers (each one for each thread),

dispatches instructions from the two instruction buffers to the shared decode, dependency checking and issue pipeline according to the thread scheduling rules.

Instructions that are either difficult to implement in hardware or are rarely used can be split into a few simple PowerPC instructions and are stored in a ROM (such as Load string or several Condition Register (CR) instructions.

Most microcoded instructions are split into two or three microcoded instruction.

The Microcode Engine inserts microcoded instructions from one thread into the instruction flow with a delay of 11 clock cycles.

The Microcode Engine stall dispatching from the instruction buffers until the last microcode of the microcoded instruction is dispatched.

The next dispatch cycle belongs to the thread that did not invoke the Microcode Engine.

Receives dispatched instructions (up to two in each cycle from the same thread),

it decodes them, checks for dependencies, and issues instructions for execution according to the issue rules.

Main components of the PPE


• XU (FX Execution Unit)

32x64-bit register file/thread

FXU (FX Unit)

LSU (L/S Unit)

BRU (Branch Unit)

Per thread branch prediction (6 bit global history, 4 K x 2 bit history table)

• VSU (Vector Scalar Unit)

FPU (FP Unit):

• 32x64 bit register file/thread• 10-stage double precision pipeline

VMX (Vector-Media Execution Unit), called also as the VXU (Vector Execution Unit)

o 32x128 bit vector register file/thread,o simple, complex, permute and single-precision FP subunits,o 128-bit SIMD instructions with varying data width

(2x64-bit, 4x32-bit, 8x16-bit, 16x8-bit, 128x1-bit).

VMX/FPU issue queue (Vector-Media Execution Unit/FP Unit) called also as the VSU (Vector-Scalar Unit) issue queue (two entries)


Instr. fetch

• Instruction fetch operates autonomously in order to keep each thread’s

instruction buffer full with useful instructions that are likely to be needed.

• 4 instr./cycle are fetched strictly alternating between the two threadsfrom the L1 I$ to the private Instruction Buffers of the threads.

• The fetch address is taken from the Instruction Fetch Address Registersassociated with each thread (IFAR0, IFAR1). The IFARs are distinct from theProgram Counters (PC) associated with both threads; the PCs track the actualprogram flow while the IFARs track the predicted instruction execution flow.

• Accessing the taken path after a predicted-taken branch requires 8 cycles.

Basic operation of the PPE


Instruction dispatch

• Moves up to two instructions either from one of the Instruction Buffers or the Microcode Engine (complex instructions) to the shared decode, dependency check and issue pipeline.

• Instruction dispatch is governed by the dispatch rules (thread scheduling rules).

• The dispatch rules take into account thread priority and stall conditions(see slide 115).

• Each pipeline stage beyond the dispath point contains instructions from one thread only.

Instruction decode and dependency checking

Decoding of up to two instructions from the same tread in each cycleand checking for dependencies


Figure 4.2.10: Instruction flow in the PPE [4.2.2.4]

IU: Instruction Unit

VSU: Vector Scalar Unit

XU: FX Execution Unit

VXU: Vector Execution Unit

FPU: FP Execution Unit

BRU: Branch Unit

FXU: FX Execution Unit

LSU: L/S Execution Unit

(IFAR: Instr. Fetch Addr.)

ibuf: Instr. Buffer

IC: Instruction cache

IB: Instruction buffer

ID: Instruction decode

IS: Instruction issue

Pipeline stages Units

4

2


• Forwarding up to two PowerPC or vector/SIMD multimedia extension instructionsper cycle from the IS2 pipeline stage for execution to the

VSU (VMX/FPU) issue queue (up to two instr./cycle) or the

BRU, LSU, FXU execution units (up to one instr./cycle per execution unit).

• Any issue combinations are allowed, except two instructions to the same unitwith a few restrictions. (See Figure 4.2.11 for the valid issue combinations.)

Note that valid resp. invalid issue combinations result from the underlyingmicroarchitecture, as shown in Figure 4.2.13.

• Instructions are issued in each cycle from the same thread.

• Instruction issue can be stalled at the IS2 pipeline stage for various reasons, like invalidissue combinations, full VSU issue queue.

Instruction issue at the pipeline stage IS2

Instruction issue from the VSU (VMX/FPU) issue queue

Forwarding up to two VMX or FPU instructions to the respective execution units.

Note that instructions kept in the issue queue are already prearranged for execution, i.e. they obey the issue restrictions summarized in Figure 4.2.11.


(older instr.)

(younger instr.)

Figure 4.2.11: Valid issue combinations (designated as pink squares) [4.2.2.4]

Type 1 instructions: VXU simple, VXU complex, VXU FP and FPU arithmetic instructions,Type 2 instructions: VXU load, VXU store, VXU permute, FPU load and FPU store instructions.


Figure 4.2.12: Pipeline stages of the PPE [4.2.2.3]


• Four 16 byte data rings, supporting multiple transfers

• 96B/cycle peak bandwidth

• Over 100 outstanding requests

• 300+ GByte/sec @ 3.2 GHz

EIB data ring for internal communication [4.2.2.2]


SPE [4.2.2.2]

SPE

Main Components

a) SPU (Synergistic Processing Unit)

b) MFC (Memory Flow Controller)

c) LS (Local Store)

d) AUC (Atomic Unit and Cache)

• SPEs are not intended to run an OS

• SPEs optimized for data-rich operation

• are allocated by the PPE


a) SPU

• Dual-issue superscalar RISC core supporting basically a 128-bit SIMD ISA.

• The SIMD ISA provides FX, FP and logical operations on 2x64-bit, 4x32-bit, 8x16-bit,

16x8-bit and 128x1-bit data.

• In connnection with the MFC the SPU support also a set of commands for

- performing DMA transfers,

- interprocessor messaging and

- synchronization.

• The SPU executes instructions from the LS (256 KB),

• Instructions reference data from the 128x128-bit unified register file,

• The Register file fetches/delivers data from/to the LS by L/S instructions,

• The SPU moves instructions and data between the main memory and the local store by requesting DMA transfers from its MFC. (Up to 16 outstanding DMA requests allowed).

Overview


SPULS

MFC

Even pipe Odd pipe

Figure 4.2.13: Block diagram of the SPU [4.2.2.3]


• Instruction issue unit – instruction line buffer

Fetches 32 instructions per LS request from the LS into the Instruction line buffer.

Main components of the SPU

• Register file Unified Register file of 128 registers each 128-bit wide.

• Result forwarding and staging

Instructions are staged in an operand staging network for up to 6 additinal cycles to achieve that all execution units write their results in the Register file in the same pipeline stage. (See Figure ccc.

Instruction fetching is supported by hardware prefetching. Pefetching requires 15 cycles to fill the instruction line buffer. Fetched instructions are decoded and issued (up to two instructions per cycle) according to the issue rules.


The odd pipeline includeso the Channel unit,o Branch unit,o Load/Store unit and o the Permute unit.

• Execution units

Execution units are organised into two pipelines.

The even pipeline includeso the Fixed-point unit ando the Floating-point unit.


Instruction issue

• The SPU issues up to two instructions per cycle from a 2-instructions wide issue window, called the fetch group.

• Fetch groups are aligned to doubleword boundaries, i.e. the first instructionis at an even and the second one at an odd word address. (Words are

4-Byte long).

• An instruction becomes issueable when no register dependencies or resource conflicts, e.g. busy execution units, exist.

• Instructions are issued in program order, that is

- if the first instruction of a fetch group can be issued to the even pipeline and the second instruction to the odd pipeline both instructions are issued in the same cycle,

- in all other cases instruction issue needs two cycles such that instructions are issued in program order to the pertaining pipeline (see Figure 4.2.14).

• Register or resource conflicts stall instruction issue.

• A new fetch group is loaded after both instructions of the current fetch group are issued.

Basic operation of the SPU


Figure 4.2.14: Instruction issue example [4.2.2.4]

(Assuming that instruction issue is not constrained by register or resource conflicts)


Figure 4.2.15: The channel interface between the SPU and the MFC [4.2.2.4]

SPU channels

• An SPU communicates with its associated MFC as well as (via its MFC) with the PPE,

other SPEs and devices (such as a decrementer) through its channels.

MMIO: Memory-Mapped I/O Registers

SLC: SPU Load and Store Unit

SSC: SPU Channel and DMA Unit


• SPU channels are unidirectional interfaces for

sending commands (such as DMA commands) to the MFC, owned by the SPU or

sending/receiving up to 32-bit long messages between the SPU and the PPE or other SPEs.

• Each channel has

- a corresponding capacity (maximum message entries) and

- count (remaining available message entries).

The channel count

- decrements when ever a channel instruction (rdch or wrch)is issued, and

- increments whenever an operation associated with the channnel completes.

The channel count of „0” means

- empty for read only channels and

- full for write only channels.

• SPU channels are implemented in and managed by the MFC.


Figure 4.2.16: Assembler instruction mnemonics and their corresponding C-language intrinsics of the channel instructions available for the SPU [4.2.2.4]

• The SPU can read or write its channels by three instructions;

the read channel (rdch),

write channel (wrch) and

read channel count (rchcnt)

instructions.

(Intrinsics represent in-line assembly code segments in the form of C-language function calls).


• The channel instructions or DMA commands evoked by channel instructions are enqueued for execution in the MFC for purposes like

initiating DMA transfers between the SPE’s LS and the main storage,

queuring DMA and SPU status,

sending or receiving up to 32-bit long mailbox messages primarily between

the SPU and the PPE or

sending or receiving up to 32-bit long signal-notification messages

between the SPU and the PPE or other SPEs.

• The PPE and other devices in the system including other SPEs, can also access the channels through the MFC’s memory mapped I/O (MMIO) registers and queues, which are visible to software in the main storage space.


Figure 4.2.17: SPE channels and associated MMIO registers (1) [4.2.2.4]


Figure 4.2.18: SPE channels

and associated MMIO registers (2) [4.2.2.4]


Figure 4.2.19: Pipeline stages of the SPUs [4.2.2.1]



b) Memory Flow Controller (MFC) [4.2.2.2]

• acts as a specialized co-processor for its associated SPU by executing autonomously its own command set and

• serves as the SPU’s interface, via the EIB to main storage and other processor elements, such as other SPEs or system devices.

Figure 4.2.20: Block diagram of the MFC [4.2.2.4]

MMIO: Memory-Mapped I/O Registers

SLC: SPU Load and Store Unit

SSC: SPU Channel and DMA Unit


The MFC as a specialized co-processor

• DMA commands

• DMA List commands and

• synchronization commands.

• DMA commands (put, get)

• can be initaiated by both the PPE and the SPU,

• move up to 16 KByte of data between the LS and the main storage,

• supports transfer sizes of 1, 2, 4, 8. 16 bytes and multiples of 16 bytes,

• access main store by using main storage effective addresses,

• can be tagged with a 5-bit tag (tag group ID) to allow special handling

within the tag group, such a to enforce ordering of the DMA commands.

• DMA list commands (put, get commands with the command modifier l)

• can be initiated only by the SPU,

• consist of up to 2 K 8-byte long list elements,

• each list element specifies a DMA transfer

• used to move data between a contiguous area in the LS and possible noncontiguous area in the effective address space implementing scatter-gather functions between main storage and the LS).

It executes three types of commands


• Synchronization comands

• used basically to control the order of storage accesses,

• include atomic commands (a form of semaphores), send signal commands and barrier commands.

• The MFC maintains two separate command queues

- the 16-entry SPU comand queue for commands from the SPU associated with the MFC, and

- the 8-entry proxi command queue for commands from the PPE, other SPEs and devices.

Operation of the MFC

• The MFC supports out-of-order execution of DMA commands.


• supports storage protection on the main storage side while performing DMA transfers,

• maintains synchronization between main storage and the LS,

• performs intercore communication functions, such as mailbox and signal-notification messaging with the PPE, other SPEs and devices.

The MFC as the interface between the SPU and the main storage, the PPE and other devices


Intercore communication tools of the MFC

• three mailboxes, primarily intended for holding up to 32-bit long messages from/to the SPE:

- one four-deep mailbox for receiving mailbox messages and

- two one-deep mailbox for sending mailbox messages.

• two signal notification channels for receiving signals sent basically by the PPE.


Figure 4.2.21: Contrasting mailboxes and signals [4.2.2.4]


c) Local Store [4.2.2.2]

SPE


• Single-port SRAM cell.

• Executes DMA reads/writes and instruction prefetches via 128-Byte wide read/write ports

• Executes instruction fetches and load/stores via 128-bit read/write ports.

• Asynchronous, coherent DMA commands are used to move instructions and data between the local store and system memory.

• DMA transfers between the LS and the main storage are executed by the SMF’s DMA unit

• A 128-Byte long DMA read or write requires 16 processor cycles to forward data on the EIB.

d) The Atomic Update and Cache unit [4.2.2.2]

SPE


The Atomic Unit

• executes atomic operations (a form of mutual-exclusion (mutex) operations) invoked by the MFC,

• supports Page Table lookups and

• maintains cache coherency

by supporting snoop operations.

The Atomic Cache

six 128-byte cache lines of datato support atomic operationsand Page Table accesses.

Broadband Interface Controller (BIC) [4.2.2.2]

• Provides a wide connection to external devices

• Two configurable interfaces (50+GB/s @ 5Gbps)

Configurable number of bytes

Coherent (BIF) and/or I/O (IOIFx) protocols

• Supports two virtual channels per interface

• Supports multiple system configurations

Memory Interface Controller (MIC)

• Dual XDRTM controller (25.6GB/s @ 3.2Gbps)

• ECC support

• Suspend to DRAM support


Thread scheduling

depends both on

• thread states

• thread priorities

• single threaded or dual threaded mode of execution

Multithreading the PPE (1)

Scheduling of PPE threads

1. Thread states

• Privilege states

• Suspended/enabled state

• Blocked/not blocked state

a) Privilege States

• Hypervisor state

• Supervisor state

• Problem state (user state)

• most privileged

• allows to run a meta OS that manages logical partitions in which multiple OS instances can run

• some system operations require the initiating thread to be in hypervisor state

• is the state in which an OS instance is intended to run

• is the state in which an application is intended to rum


Figure 4.2.22: Bits of the Machine State Register (MSR) defining the privilege state of a thread[4.2.2.4]

(HV: Hypervisor, PR: Problem)


b) Suspended/enabled State

• a thread in the hypervisor state can change its state from enabled to suspended.

• Two bits of the Control Register (CTRL[TE0], [TE1]) define whether a thread is in the suspended or enabled state.

c) Blocked/stalled State

• Blocking- occurs at the instruction dispatch stage if the thread selection rule favours the other thread, or due to a special „nop” instruction,

- stops only one of the two threads.

• Stalling

- occurs at the instruction issue stage due to dependencies

- stops both threads.

- for very long latency conditions, such as L1 cache misses, or devide instructions, stalling both threads is avoided, by

-- flushing instructions younger than the stalled instruction,

-- instructions starting with the stalled instruction are refetched and

-- the thread is stalled at the dispatch stage until the stall condition is removed, but then the other thread can be continued to dispatch.


2. Thread priorities

• determines dispatch priority

• four priority levelsthread disabledlow prioritymedium priorityhigh priorith

• priority levels are specified by a 2-bit field (TP field) of the TSRL register (Thread Status Regiter Local)

• Software, in particular OS software, sets thread priorities (according to the throughput requirements of the programs running in the threads.) E.g. a foreground/background thread priority scheme can be set, to favor one thread over the other when allocating instruction dispatch slots.

• A thread must be in the hypervisor or supervisor state to set its priority to high.


Figure 4.2.23:Usual thread priority combinations [4.2.2.4]

The combination high priority thread/low priority thread is not expected to be used,as in this case the PPE would never dispatch instructions from the low priority threadunless the high priority thread was unable to dispatch.

Usual thread priority combinations


Figure 4.2.24: Thread scheduling when both priorities are set to medium [4.2.2.4]

• The PPE attempts to utilize all available dispatch slots.

• Thread scheduling is fair (round robin scheduling).

• If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt.

Note:The same scheduling applies when both threads are set to high priority.

Basic scheduling rules

Example (1): Scheduling in case of the medium priority/medium priority setting


• The PPE attempts to utilize most available dispatch slots for the medium priority thread (this setting is appropriate to run a low-priority program in the background)

• Assuming a duty cycle of 5 (TSRL[DISP_COUNT] = 5) instructions from thread 1 are dispatched on four out of five cycles while instructions from thread 0 are dispatched only on one out of five cycles.



Figure 4.2.25: Thread scheduling when one thread runs at medium priority while the otherat low priority [4.2.2.4]

Example (2): Scheduling in case of the low priority/medium priority setting



• Assuming a duty cycle of 5, both threads are scheduled only once every 5 cycles.

• The PPE attempts to dispatch only once every duty cycle (TSCR[DISP_COUNT]) cycles. (With high values of DISP-COUNT the PPE will mostly idle, which will reduce power comsuption and heat production while keepint both threads alive.)


• Thread scheduling is fair (round robin scheduling)

Figure 4.2.26: Thread scheduling when both priorities are set to low [4.2.2.4]

Example (3): Scheduling in case of the low priority/low priority setting


3. Single threaded/dual threaded mode of execution

• In single threaded mode all resources are allocated to a single thread, this reduces the turnaround time of the thread.

• Software can change the operating mode of the PPE between single threaded and dual threaded mode only in the hypervisor state .


Software controlled thread behaviour

software can use various schemes to controll thread behaviour, including

• enabling and suspending a thread,

• by setting thread priorities to control instruction dispatch policy,• executing a special nop instruction to cause temporary dispatch blocking,

• switching the state of the PPE between single threaded and multithreaded mode.


• Duplicated architectural states for

32 GPRs32 FPRs32 Vector Registers (VRs)Condition Register (CR)Count Register (CTR)Link Register (LR)FX Exception Register (XER)FP Status and Control Register (FPSCR)Vector Status and Control Register (VSCR)Decrementer (DEC)


Core enhancements for multithreading

• Duplicated microarchitectural states for

Branch History Table (BHT) with global branch history

(to allow independent and simultaneous branch prediction for both threads)

Internal registers associated with exceptions and interrupt handling, such as

Machine State Register (MR)Machine Status Save/Restore Registers (SRR0, SRR1)Hipervisor Machine Status Save/Restore Registers (HSRR0, HSRR1)FP Status and Control Register (FPSCR) etc.

(to allow concurrent exception and interrupt handling)


• Duplicated queues and arrays

Segment lookaside buffer (SLB)Instruction buffer queue (Ibuf) (to allow each thread to dispatch regardles of any dispatch stall in the other thread)Link stack queue

The instruction fetch control (because the I$ has only one read port and so fetching must alternate between threads every cycle).

• Shared resources

Hardware execution units

Virtual memory mapping (as both threads always execute in the same logical partitioning context)

Most large arrays and queues, such as caches that consume significant amountof chip area


• Application specific SPU

accelerators,

• Multi-stage pipeline SPU

configuration or

• Parallel-stages SPU configuration;

Basic SPU configurations

assumes the choice of an appropriate SPU configuration


The programming model

Application specific SPU accelerators [4.2.2.5]


Multi-stage SPU pipeline configuration [4.2.2.5]


Parallel-stages SPU configuration [4.2.2.5]

Programming models (1)

• Programmer writes/uses SPU „libraries” either for

Application specific SPU accelerators,

Multi-stage pipeline SPU configuration or

Parallel-stages SPU configuration; e.g. for;

Basic approach for creating an application

• Programmer chooses the appropriate SPU configuration according to the features of an application, such as

Graphics processing Audio processing MPEG Encoding/Decoding Encryption/Decryption

• Main application in PPE, invokes SPU bound services by

creating SPU threads RPC like function calls I/O device like interfaces (FIFO/command queue)

• One ore more SPUs cooperate in the presumed SPU configuration to execute the tasks required.


• Acceleration

provided by OS or application libraries

• Application portability

maintained with platform specific libraries



Aim

• showing the cooperation between PPE and SPE

Program

• Actual goal: To calculate distance travelled in a car

• It asks for: elapsed time speed

Program structure

• There are two program codes, one for the PPE and one for the SPE.

• The PPE does the user input, then it calls the SPE executable which calculates the distance and then returns with the result.

• The result is then given to the user by the PPE.

Example

PPE Main Store

SPE 1

SPE n

notifying SPE of work to be done

(create_spu_thread)

Loading program

and data to main store

Local Store 1

Local Store n

copying data from MS to LS

(mfc_get)

Accessing data

Example


PPE Main Store

SPE 1

SPE n

Execution of the SPEthread

Local Store 1

Local Store n


PPE Main Store

SPE 1

SPE n

SPE notifies PPE„job is finished”

by sending a message

Loading results

from main store

Local Store 1

Local Store n

copying data from LS to MS

(mfc_get)

Updating

results


#include <stdio.h>#include <libspe.h>

extern spe_program_handle_t calculate_distance_handle;

typedef struct {float speed; //input parameterfloat num_hours; //input parameterfloat distance; //output parameterfloat padding; //pad the struct a multiple of 16 bytes

} program_data;

int main() {program_data pd __attribute__((aligned(16))); //aligned for transfer

printf("Enter the speed in miles/hr: ");scanf("%f", &pd.speed);printf("Enter the number of hours you have been driving: ");scanf("%f", &pd.num_hours);

speid_t spe_id = spe_create_thread(0, &calculate_distance_handle, &pd,NULL, -1, 0);

spe_wait(spe_id, NULL, 0);

printf("The distance travelled is %f miles.\n", pd.distance);

return 0;}

External SPE program (next slide)

Define the data structurepassed to the SPE task

Data input

Create the thread and wait for it to finish

Data output


#include <spu_mfcio.h>

typedef struct {float speed; //input parameterfloat num_hours; //input parameterfloat distance; //output parameterfloat padding; //pad the struct a multiple of 16 bytes

} program_data;

int main(unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) {

program_data pd __attribute__((aligned(16)));int tag_id = 0;

mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0);mfc_write_tag_mask(1<<tag_id);mfc_read_tag_status_any();

pd.distance = pd.speed * pd.num_hours;

mfc_put(&pd, program_data_ea, sizeof(program_data), tag_id, 0, 0);mfc_write_tag_mask(1<<tag_id);mfc_read_tag_status_any();return 0;

}

Define the data structureto communicate with the SPE

Copy data from MS to LSWait for completition

Calculate the result

Copy data from LS to MSWait for completition


Implementation of the Cell BE (1)

Figure 4.2.27: Cell system configuration options [4.2.2.3]

Implemetation alternatives

Source: Brochard L., A Cell History,” Cell Workshop, April, 2006 http://www.irisa.fr/orap/Constructeurs/Cell/Cell%20Short%20Intro%20Luigi.pdf

Figure: Cell BE Blade Roadmap



Figure 4.2.28: Motherboard of the Cell Blade (QS20) [4.2.2.5]

Motherboard of the Cell Blade (QS20)

Source: Hofstee H. P., „Real-time Superconputing and Technology for Games and Entertainment,” 2006, http://www.cercs.gatech.edu/docs/SC06_Cell_111606.pdf

Figure: Roadmap of the Cell BE

Cell BE roadmap (1)

References

Cell BE

[4.2.2.1] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006,http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf

[4.2.2.2] Hofstee P., „Tutorial: Hardware and Software Architectures for the CELL BROADBAND ENGINE processor”, IBM Corp., September 2005http://www.crest.gatech.edu/conferences/cases2005/pdf/Cell-tutorial.pdf

[4.2.2.3] Kahle J.A., „Introduction to the Cell multiprocessor”, IBM J. Res & Dev Vol. 49, 2005, pp. 584-604 http://www.research.ibm.com/journal/rd/494/kahle.pdf

[4.2.2.4]: Cell Broadband Engine Programming Handbook Vers. 1.1, Apr. 2007, IBM Corp.

[4.2.2.5] Cell BE Overview, Course code: L1T1H1-02, May 2006, IBM Corp.

http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf

http://www.crest.gatech.edu/conferences/cases2005/pdf/Cell-tutorial.pdf

http://www.research.ibm.com/journal/rd/494/kahle.pdf

http://www.research.ibm.com/journal/rd/494/kahle.pdf

4.3 SMT multithreaded processors

4.3.1. Intel Pentium 4

4.3.2. Alpha 21464 (V8)

4.3.3. IBM Power5

Coarse grained MT cores

Fine grained MT cores SMT cores


4.3. Simultaneously multithreaded processors

Intel designates SMT as Hyperthreading (HT)

Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores),

followed by in the Northwood core intended for desktops in 11/2002.

Additions for implementing MT:

• Duplicated architectural state, including

• instruction pointer,• the general purpose regs.,• the control regs.,• the APIC (Advanced Programable Interrupt Controller) regs.,• some machine state regs.

4.3.1. Intel Pentium 4 (1)

Figure 4.3.1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous

pointers and control logic, but these are too small to point them out.

Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65.


• Further enhancements to support MT (thread microstate):

• TC-entries (Trace cache) are tagged,• BHB (Branch History Buffer) is duplicated,• Global History Table is tagged,• RAS (Return Address Stack) is duplicated,• Rename tables are duplicated,• ROB is tagged.


Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores),

followed by the Northwood core for desktops in 11/2002.





Figure 4.3.2: SMT pipeline in Intel’s Pentium 4/HT

Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”,Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16


Additional die area required for MT: less than 5 %.

Single thread/dual thread modes: To prevent single thread performance degradation:

in single thred mode partitioned resources are recombined.

• Further enhancements to support MT (thread microstate):

• TC-entries (Trace cache) are tagged,• BHB (Branch History Buffer) is duplicated,• Global History Table is tagged,• RAS (Return Address Stack) is duplicated,• Rename tables are duplicated,• ROB is tagged.


Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002.





Alpha 21264 Alpha 21464

GPRsFPRs

8080

Core enhancements for 4-way multithreading:

• Providing replicated (4 x) thread states for:

PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):

8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.

512

Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243

In 2001 all Alpha intellectual property rights were sold to Intel.

4.3.2. Alpha 21464 (V8) (1)

Figure 4.3.3: SMT pipeline in the Alpha 21464 (V8)

Better answers

SMT PipelineSMT Pipeline

Fetch Decode/Map

Queue RegRead

Execute Dcache/Store Buffer

RegWrite

Retire

IcacheDcache

PC

RegisterMap

Regs Regs

Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com

4.3.2. Alpha 21464 (V8) (2)

Alpha 21264Alpha 21464

GPRsFPRs

8080




• Providing replicated (4 x) thread microstates for:

Register Maps,


512



4.3.2. Alpha 21464 (V8) (3)

Figure 4.3.4: SMT pipeline in the Alpha 21464 (V8)

Better answers

SMT PipelineSMT Pipeline

Fetch Decode/Map

Queue RegRead

Execute Dcache/Store Buffer

RegWrite

Retire

IcacheDcache

PC

RegisterMap

Regs Regs

Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com

4.3.2. Alpha 21464 (V8) (4)

Alpha 21264Alpha 21464

GPRsFPRs

8080




• Providing replicated (4 x) thread microstates for:

Register Maps,

Additional core area needed for SMT: ~ 6 %


512



4.3.2. Alpha 21464 (V8) (5)

POWER5 enhancements vs the POWER4:

• on-chip memory control,

4.3.3. IBM POWER5 (1)

Figure 4.3.6: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.

FabricController

4.3.3. IBM POWER5 (4)


• on-chip memory control,• exclusive L3 cache,

4.3.3. IBM POWER5 (3)

Figure 4.3.5: POWER4 and POWER5 system structures

Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.

FabricController

4.3.3. IBM POWER5 (2)

Inclusive L3 cache

Exclusive L3 cache


• on-chip memory control,• exclusive L3 cache,• dual threaded.

4.3.3. IBM POWER5 (5)

Figure 4.3.7: Microarchitecture of IBM’s POWER5

Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003

4.3.3. IBM POWER5 (6)

Figure 4.3.8: IBM POWER5 Chip


4.3.3. IBM POWER5 (7)

POWER4 POWER5

GPRsFPRs

80 12072 120

Core enhancements for multithreading:

• Providing duplicated thread states for:


4.3.3. IBM POWER5 (8)

Figure 4.3.9: SMT pipeline of IBM’s POWER5


4.3.3. IBM POWER5 (9)

POWER4 POWER5

GPRsFPRs

80 12072 120


• Providing duplicated architectural states for:


• Providing duplicated thread microstates for:

Return Address Stack, Group Completion (ROB)

4.3.3. IBM POWER5 (10)



4.3.3. IBM POWER5 (11)

POWER4 POWER5

GPRsFPRs

80 12072 120






• Providing increased (in fact duplicated) sizes for scarce or sensitive resorces, such as:

Instruction Buffer, Store Queue

4.3.3. IBM POWER5 (12)



4.3.3. IBM POWER5 (13)

POWER4 POWER5

GPRsFPRs

80 12072 120






Additional core area needed for SMT: ~ 10 %

• Providing increased (duplicated) size for scarce or sensitive resorces, such as:

Instruction Buffer, Store Queue

4.3.3. IBM POWER5 (14)

Unbalanced execution of threads:

(an enhancement of the single mode/dual mode thred execution model)

• Threads have 8 priority levels (0...7) controlled by HW/SW,• the decode rate of each thread will be controlled according to the associated priority


Figure 4.3.12: Unbalanced execution of threads in IBM’s POWER5

4.3.3. IBM POWER5 (15)

Difference inthread priority

Development effort:

• Concept phase: ~ 10 persons/ 4 month• High level design phase: ~ 50 persons/ 6 month• Implementation phase: ~ 200 persons/ 12-18 month


4.3.3. IBM POWER5 (16)

multithreaded processors dezső sima spring 2007 (ver. 2.1) dezső sima, 2007

Documents