multithreaded processors dezső sima spring 2007 (ver. 2.1) dezső sima, 2007
TRANSCRIPT
Multithreaded Processors
Dezső Sima
Spring 2007
(Ver. 2.1) Dezső Sima, 2007
Overview
1. Introduction
3. Thread scheduling•
2. Overview of multithreaded cores•
4. Case examples•
4.1. Coarse grained multithreaded cores•
4.2. Fine grained multithreaded cores•
•
4.3. SMT cores•
1. Introduction
to raise performance (beyond superscalar or EPIC execution) by introducing and utilizing finer grained parallelism than multitasking at execution.
flow of control
dynamic sequence of instructions to be executed.
Aim of multithreading
Thread
1. Introduction (1)
Sequential
programming
Multitasked programming Multithreaded programming
P2
Pro
cess
/ T
hrea
d M
anag
emen
t Exa
mpl
e
P1
P2
P3
P2
P1
P3
fork()
fork()
join()
P1
T1
exec()
exec()
T3
T2
T5T4
CreateThread()
T6
Create Process()
1. Introduction (2)
Figure 1.1: Principle of sequential-, multitasked- and multithreaded programming
• belong to the same process,
• share usually a common address space (else multiple address translation paths (virtual to real) need to be maintained concurrently),
• are executed concurrently (simultaneously (i.e. overlapped by time sharing) or in parallel), depending on the impelmentation of multithreading .
• creation, control and termination of individual threads,
• context swithing between threads,
• maintaining multiple sets of thread states.
Threads
• thread program state (state of the ISA) including the PC, FX/FP architectural registers, state registers,
• thread microstate (supplementary state of the microarchitecture) including the rename register mappings, branch history, ROB etc.
Main features of multithreading
Main tasks of thread management
Basic thread states
1. Introduction (3)
Software multithreading
Implementation of multithreading
Hardware multithreading
Execution of multithreaded apps/OSson a single threaded processor
simultaneously (i.e. by time sharing)
Execution of multithreaded apps/OSson a multithreaded processor
concurrently
(while executing multithreaded apps/OSs)
Fast context swithing between threads required.
Maintaining multiple threads simultaneously by the OS
Maintaining multiple threads concurrently by the processor
Multithreaded OSs Multithreaded processors
1. Introduction (4)
Multithreaded processors
Multicore processors Multithreaded cores
Chip
L3/Memory
L2/L3Core Core
L3/Memory
MTcore
L2/L3
(SMP: Symmetric MultiprocessingCMP: Chip Multiprocessing)
1. Introduction (5)
• Maintaining multiple thread program states concurrently by the processor,
including the PC, FX/FP architectural registers, state registers
• Maintaning multiple thread microstates, pertaining to:
rename register mappings, the RAS (Return Address Stack), theROB, etc.
• Providing increased sizes for scarce or sensitive resorces, such as:
the instruction buffer, store queue,in case of merged arch. and rename registersappropriatly large file sizes (FX/FP) etc.
including the PC, FX/FP architectural registers, state registers
Maintaining multiple thread program states concurrently by the OS,
• Implementing individual per thread structures, like 2 or 4 sets of FX registers,
• Implementing tagged structures, like a tagged ROB, a tagged buffer etc.
Requirement of software multithreading
Options to provide multiple states
1. Introduction (6)
Core enhancements needed in multithreaded cores
Multicore processors
Multithreaded cores
Additional complexity
~ (60 – 80) % ~ (2 – 10) %
Additional gain
(in gen. purp. apps)
~ (60 – 80) % ~ (0 – 30) %
1. Introduction (7)
• Windows NT• OS/2
• Unix w/Posix
• most OSs developed from the 90’s on
Multithreaded OSs
1. Introduction (8)
Sequential programs
Multitasked programs
Multithreaded programs
Software implementation
Software multithreading
Hardware multithreading
on a multithreaded
core
on a multicore proc.
Single process on a single processor
Multiple processes on a single processor using time sharing
Multithreaded software on a single threaded processor using time sharing
Multithreaded software on a multithreaded core
Multithreaded software on a multicore processor
No issues with parallel programs
Multiple programs with quasi-parallel execution
Private address spaces
Multiple programs with quasi-parallel execution
Shared process address spaces
Thread context switch needed
Simultaneous execution of threads
Threads share address space
No thread context switches needed (except coarse grained MT)
True parallel execution of threads
Threads share address space
No thread context switches needed
Sequential bottleneck
Solutions for fast context switching
Thread state management and context switching
Thread schedulingIntra-core communication
Descri
pti
on
Key F
eatu
res
Key
Issu
es
Introduction (9)
Figure 1.2: Contrasting sequential-, multitasked- and multithreaded execution (2)
Sequential programs
Multitasked programs
Multithreaded programs
Software implementation
Software multithreading
Hardware multithreading
on a multithreaded
core
on a multicore proc.
Legacy OS support
Traditional UnixMost modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)
Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)
Most modern OS’s (Windows NT/2000, OS/2, Unix w/Posix)
Low Low-medium High Higher Highest
No API level support
Process life cycle management API
Process and thread management APIExplicit threading APIOpenMP
Process and thread management APIExplicit threading APIOpenMP
Process and thread management APIExplicit threading APIOpenMP
OS
Su
pp
ort
Soft
ware
D
evelo
pm
en
tP
erf
orm
an
ce
Level
Introduction (10)
Figure 1.3: Contrasting sequential-, multitasked- and multithreaded execution (2)
2. Overview of multithreaded cores
Figure 2.1: Intel’s multithreaded desktop families
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
Pentium EE 955/965(Presler)
2*188 mtrs./130 W
1/06
65 nm/2*81 mm2
2-way MT/core
Pentium EE 840(Smithfield)
230 mtrs./130 W
5/05
90 nm/2*103 mm2
2-way MT/core
2002 2004 20061H 2H
Pentium 4(Northwood B)
55 mtrs./82 W
11/02
130 nm/146 mm2
2-way MT
Pentium 4(Prescott)
125 mtrs./103 W
02/04
90 nm/112 mm2
2-way MT
2. Overview of multithreaded cores (1)
Figure 2.2.: Intel’s multithreaded Xeon DP-families
Xeon DP 2.8
10/05
2*169 mtrs./135 W 90 nm/2*135 mm2
(Paxville DP)
2-way MT/core2*188 mtrs./95/130 W
Xeon 5000
6/06
65 nm/2*81 mm2
(Dempsey)
2-way MT/core
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
2002 2004 20061H 2H
Pentium 4
2/02
55 mtrs./55 W 130 nm/146 mm2
(Prestonia-A)
2-way MT
Pentium 4
11/03
169mtrs./110 W 130 nm/135 mm2
(Irwindale-A)
2-way MT
Pentium 4
6/04
125 mtrs./103 W 90 nm/112 mm2
(Nocona)
2-way MT
2. Overview of multithreaded cores (2)
Xeon 7000
11/05
2*169 mtrs./95/150 W 90 nm/2*135 mm2
(Paxville MP)
2-way MT/core1328 mtrs./95/150 W
Xeon 7100
8/06
65 nm/435 mm2
(Tulsa)
2-way MT/core
Figure 2.3.: Intel’s multithreaded Xeon MP-families
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
2002 2004 20061H 2H
Pentium 4
3/02
108 mtrs./64 W 180 nm/ n/a
(Foster-MP)
2-way MT
Pentium 4
3/04
178/286 mtrs./77 W 130 nm/310 mm2
(Gallatin)
2-way MT
Pentium 4
3/05
675 mtrs./95/129 W 90 nm/339 mm2
(Potomac)
2-way MT
2. Overview of multithreaded cores (3)
Figure 2.4.: Intel’s multithreaded EPIC based server family
2-way MT/core
9x00(Montecito)
1720 mtrs./104 W
7/06
90 nm/596 mm2
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
2002 2004 20061H 2H
2. Overview of multithreaded cores (4)
Figure 2.5.: IBM’s multithreaded server families
POWER5 POWER5+
Cell BE PPE
POWER6
5/04 10/05
276 mtrs./70 W
2006
234* mtrs./95* W
2007
750 mtrs./~100W90 nm/230 mm2
90 nm/221* mm2
65 nm/341 mm2
276 mtrs./80W (est.)
130 nm/389 mm2
2-way MT/core 2-way MT/core
2-way MT(*: entire proc.)
2-way MT/core
RS 64 IV(Sstar)
5/04
44 mtrs./n/a
180 nm/n/a
2-way MT
8CMT
QCMT
DCMT
SCMT
2004 20061H 2H 1H 2H 1H 2H 1H 2H
2000 2005 20071H 2H
~~
2. Overview of multithreaded cores (5)
UltraSPARC T1
11/2005
279 mtrs./63 W
(Niagara)UltraSPARC T2
2007
(Niagara II)
APL SPARC64 VI
2007
540 mtrs./120 W
(Olympus)
APL SPARC64 VII
2008
(Jupiter)
90 nm/421 mm2
65 nm/464 mm2
65 nm/342 mm290 nm/379 mm2
4-way MT/core 8-way MT/core
2-way MT/core
~120 W2-way MT/core
72 W (est.)
Figure 2.6: Sun’s and Fujitsu’s multithreaded server families
8CMT
QCMT
DCMT
SCMT
2005 20071H 2H 1H 2H 1H 2H 1H 2H
2004 2006 20081H 2H
2. Overview of multithreaded cores (6)
XLR 5xx
5/05
333 mtrs./10-50 W90 nm/~220 mm2
4-way MT/core
Figure 2.7: RMI’s multithreaded XLR family (scalar RISC)
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
2002 2004 20061H 2H
2. Overview of multithreaded cores (7)
Alpha 21464 (V8)
2003
250 mtrs./10-50 W130 nm/ n/a
4-way MT
Figure 2.8: DEC’s/Compaq’s multithreaded processor
8CMT
QCMT
DCMT
SCMT
2003 20051H 2H 1H 2H 1H 2H 1H 2H
2002 2004 20061H 2H
Cancelled 6/2001
2. Overview of multithreaded cores (8)
Scalar core(s)
Underlying core(s)
Superscalar core(s) VLIW core(s)
SUN UltraSPARC T1 (2005)(Niagara)
up to 8 cores/4T
RMI XLR 5xx (2005)8 core/4T
IBM RS64 IV (2000)(SStar)
Single-core/2T
Pentium 4 based processors
Single-core/2T (2002-)Dual-core/2T (2005-)
DEC 21464 (2003)Single-core/4T
IBM POWER5 (2005)Dual-core/2T
PPE of Cell BE (2006)Single-core/2T
Fujitsu SPARC64 VI / VIIDual-core/Quad-core/2T
SUN MAJC 5200 (2000)Quad-core/4T
(dedicated use)
Intel Montecito (2006)Dual-core/2T
2. Overview of multithreaded cores (9)
3. Thread scheduling
Thread scheduling in software multithreading on a traditional supercalar processor
Figure 3.1: Thread scheduling assuming software multithreading on a 4-way superscalar processor
The execution of a new thread is initiated by a context switch (needed to save the state of the suspended thread
and loading the state of the thread to be executed next).
Thread1 Context switch Thread2
Dis
patc
h s
lots
Clock cycles
3. Thread scheduling (1)
Figure 3.2: Thread scheduling in a dual core processor
Thread scheduling in multicore processors (CMP-s)
Both t-way superscalar cores execute different threads independently.
Thread1 Thread2
Dis
patc
h s
lots
Clock cycles
3. Thread scheduling (2)
Coarse grained MT
Thread scheduling in multithreaded cores
3. Thread scheduling (3)
Figure 3.3: Thread scheduling in a 4-way coarse grained multithreaded processor
Threads are switched by means of rapid, HW-supported context switches.
3. Thread scheduling (4)
Dis
patc
h/iss
ue
slo
ts
Clock cycles
Thread1 Context switch Thread2
Scalar based
Coarse grained MT
Superscalar based VLIW based
SUN MAJC 5200 (2000)Quad-core/4T
(dedicated use)
Intel Montecito (2006?)Dual-core/2T
IBM RS64 IV (2000)(SStar)
Single-core/2T
3. Thread scheduling (5)
Coarse grained MT Fine grained MT
Thread scheduling in multithreaded cores
3. Thread scheduling (6)
Figure 3.4: Thread scheduling in a 4-way fine grained multithreaded processor
The hardware thread scheduler choses a thread in each cycle and instructions from this thread are dispatched/issued in this cycle..
3. Thread scheduling (7)
Thread1 Thread3 Thread4Thread2
Dis
patc
h/iss
ue
slo
ts
Clock cycles
Round robin selection policy
Fine grained MT
Priority based selection policy
Scalar based
Superscalar based
VLIW based
Scalar based
Superscalar based
VLIW based
SUN UltraSPARC T1 (2005)(Niagara)
up to 8 cores/4T
PPE of Cell BE (2006)single-core/2T
3. Thread scheduling (8)
Coarse grained MT Fine grained MT Simultaneous MT (SMT)
Thread scheduling in multithreaded cores
3. Thread scheduling (9)
Figure 3.5: Thread scheduling in a 4-way symultaneous multithreaded processor
Available instructions (chosen according to an appropriate selection policy, such as the priority of the threads) are dispatched/issued for execution in each cycle.
SMT: Proposed by Tullsen, Eggers and Levy in 1995 (U. of Washington).
Thread1 Thread3 Thread4Thread2
3. Thread scheduling (10)
Clock cycles
Dis
patc
h/iss
ue
slo
ts
Scalar based
SMT cores
Superscalar based VLIW based
Pentium 4based proc.s Single-core/2T (2002-)Dual-core/2T (2005-)
DEC 21464 (2003)Dual-core/4T
(canceled in 2001)
IBM POWER5 (2005)Dual-core/2T
3. Thread scheduling (11)
4. Case examples
4.1. Coarse grained multithreading
4.2. Fine grained multithreading
4.3. SMT multithreading
4.1 Coarse grained multithreaded processors
4.1.1. IBM RS64 IV
4.1.2. SUN MAJC 5200
4.1.3. Intel Montecito
Coarse grained MT Fine grained MT Simultaneous MT (SMT)
Thread scheduling in multithreaded cores
4.1. Coarse grained multithreaded processors
Optimized for commercial server workloads, such as on-line transaction processing, Web-serving, ERP (Enterprise Resource Planning).
4-way superscalar, dual-threaded.
Used in IBM’s iSeries and pSeries commercial servers.
• large working sets,• poor locality of references and• frequently occurring task switches
• high cache miss rates,
• Memory bandwidth and latency strongly limits performance.
need for high instruction and data fetch bandwidth,need for large L1 $s,using multithreading to hide memory latency.
Microarchitecture
Characteristics of server workloads:
4.1.1. IBM RS 64 IV (1)
• 128 KB L1 D$ and L1 I$,
• instruction fetch width: 8 instr./cycle,
• dual-threaded core.
Main microarchitectural features of the RS64 IV to support commercial workloads:
4.1.1. IBM RS 64 IV (2)
Figure 4.1.1: Microarchitecture of IBM’s RS 64 IV
Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
6XX bus
IERAT: Effective to realaddress translation cache
(2x64 entries)
4.1.1. IBM RS 64 IV (3)
Coarse grained MT with two Ts; a foreground T and a background T. The foreground T executes until a long latency event, such as a cache miss or an IERAT miss occurs. Subsequently, a T switch is performed and the background T begins to execute. After the long latency event is serviced, a T switch occurs back to the foreground T.
For multithreading additionally needed die area: ~ + 5 % die area
• GPRs, FPRs, CR (condition reg.), CTR (count reg.),• spec. purpose priviledged mode reg.s, such as the MSR (machine state reg..)• status and control reg.s, such as T priority.
Each T executes in its own effective address space (an unusual feature of multithreaded cores).
Dual architectural states maintained for:
Units used for address translation need to be duplicated, such as the SRs (Segment Address Reg.s)
Both single threaded and multithreaded modes of execution.
Thread Swith Buffer holds up to 8 instructions from the background T, to shorten context swithching by eliminating the latency of the I$
Threads can be allocated different priorities by explicit instructions.
Multithreading policy (strongly simplified)
Implementation of multithreading
4.1.1. IBM RS 64 IV (4)
Figure 4.1.2: Thread switch on data cache miss in IBM’s RS 64 IV
Source: Borkenhagen J.M. et al. „A multithreaded PowerPC processor for commercial servers”, IBM J.Res.Develop. Vol. 44. No. 6. Nov. 2000, pp. 885-898
4.1.1. IBM RS 64 IV (5)
(8 cycles penalty)
(2 instructions/cycles)
• up to 4 processors on a die, • each processor has 4 FUs (Functional Units); 3 of them are identical, one is enhanced,• each FU has its private logic and register set (e.g. 32 or 64 regs.,• the 4 FUs of a processor share a set of global regs., e.g. 64 regs.,• all registers are unified (not splitted into FX/FP files),• any FU can process any data type.
Each processor is a 4-wide VLIW and can be 4-way multithreaded.
Dedicated use, high-end graphics, networking with wire-speed computational demands.
Aim:
Microarchitecture:
4.1.2. SUN MAJC 5200 (1)
Figure 4.1.3: General view of SUN’s MAJC 5200
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (2)
Figure 4.1.4: The principle of private, unified register files associated with each FU
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (3)
Each processor with its 4 FUs can be operated in a 4-way multithreaded mode(called Vertical Multithreading by Sun)
Implementation of 4-way multithreading:
by executing each T by one of the 4 FUs („Vertical multithreading”)
Following a cache miss, the processor saves the T state and begins to process the next T.
Comparison of program execution without and with multithreading on a 4-wide VLIW
Considered program:
• It consists of 100 instructions,• on average 2.5 instrs./cycle executed on average,• giving birth to a cache miss after each 20 instructions.• Latency of serving a cache miss: 75 cycles.
Threading
Thread switch
Example
4.1.2. SUN MAJC 5200 (4)
Figure 4.1.5: Execution for subsequent cache misses in a single threaded processor
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (5)
Figure 4.1.6: Execution for subsequent cache misses in SUN’s MAJC 5200
Source: “MAJC Architecture Tutorial,” Whitepaper, Sun Microsystems, Inc
4.1.2. SUN MAJC 5200 (6)
• Split L2 caches for data and instructions,• larger unified L3 cache (for each core),• duplicated architectural states for
FX/FP-registers, branch and predicate registers, next address register
maintained.• (Foxton technology for power management/frequency boost,
planned but not implemented).
• the branch prediction structures provide T tagging,• per thread return address stacks,• per thread ALATs (Advance Load Address Table)
Additional core area needed for multithreading: ~ 2 %.
High end serversAim:
Main enhancements of Montecito over Itanium2
Additional support for dual-threading (duplicated microarchitectural states)
4.1.3. Intel Montecito (1)
Figure 4.1.7: Microarchitecture of Intel’s Itanium 2
Source: McNairy, C., „Itanium 2”, IEEE Micro, March/April 2003, Vol. 23, No. 2, pp. 44-55
4.1.3. Intel Montecito (2)
Figure 4.1.8: Microarchitecture of Intel’s Montecito (ALAT: Advanced Load Address Table)
Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
4.1.3. Intel Montecito (3)
5 event types cause thread switches, such as L3 cache misses, programmed switched hints.
Total switch penalty: 15 cycles
If control logic detects that a thread doesn’t make progress, a thread switch will be initiated.
Thread switches
Example for thread switching
4.1.3. Intel Montecito (4)
Figure 4.1.9: Thread switch in Intel’s Montecito vs single thread execution
Source: McNairy, C., „Montecito”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 10-20
4.1.3. Intel Montecito (5)
4.2 Fine grained multithreaded processors
4.2.1. SUN Ultrasparc T1
4.2.2. PPE of Cell BE
Coarse grained MT Fine grained MT Simultaneous MT (SMT)
Thread scheduling in multithreaded cores
4.2. Fine grained multithreaded processors
• web servicing,• transaction processing,• ERP (Enterprise Resource Planning),• DSS (Decision Support Systems)
Commercial server applications, such as
• large working sets,• poor locality of memory references.
• high cache miss rates,• low prediction accuracy for data dependent branches.
Memory latency strongly limits performance.
Multithreading to hide memory latency.
Aim
Characteristics of commercial server applications
4.2.1. SUN UltraSPARC T1 (1)
• 8 scalar cores, 4-way multithreaded each.
Structure
• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,
4.2.1. SUN UltraSPARC T1 (2)
Figure 4.2.1: Block diagram of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (3)
• 8 scalar cores, 4-way multithreaded each.
Structure
• All 32 threads share an L2 cache of 3 MB, built up of 4 banks,• 4 memory channels with on chip DDR2 memory controllers.
It runs under Solaris.
4.2.1. SUN UltraSPARC T1 (4)
Figure 4.2.2: SUN’s UltraSPARC T1 chip
Source: www.princeton.edu/~jdonald/research/hyperthreading/romanescu_niagara.pdf
4.2.1. SUN UltraSPARC T1 (5)
Processor Elements (SPARC pipes):
• Scalar FX-units, 6-stage pipeline• all Processor Elements share a single FP-unit
4.2.1. SUN UltraSPARC T1 (6)
Figure 4.2.3: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (7)
Processor Elements (SPARC pipes):
Each thread of a processor element has its private:
• PC-logic• register file,• instruction buffer,• store buffer.
• Scalar FX-units, 6-stage pipeline• all Processor Elements share a single FP-unit
4.2.1. SUN UltraSPARC T1 (8)
Figure 4.2.4: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (9)
Processor Elements (Sparc pipes):
Each thread of a processor element has its private:
• PC-logic,• register file,• instruction buffer,• store buffer.
No thread switch penalty!
• Scalar FX-units, 6-stage pipeline• all Processor Elements share a single FP-unit
4.2.1. SUN UltraSPARC T1 (10)
Thread switch:
Threads are switched on a per cycle basis.
Selection of threads:
In the thread select pipeline stage
• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and
• fetches the following instr. of the same thread into the instruction buffer.
4.2.1. SUN UltraSPARC T1 (11)
Figure 4.2.5: Microarchitecture of the core of SUN’s UltraSPARC T1
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
4.2.1. SUN UltraSPARC T1 (12)
Thread switch:
Threads are switched on a per cycle basis.
In the thread select pipeline stage• the thread select multiplexer selects a thread from the set of available threads in each clock cycle and issues the subsequent instr. of this thread from the instruction buffer into the pipeline for execution, and
• fetches the following instr. of the same thread into the instruction buffer.
Selection of threads:
Threads become unavailable due to:
• long-latency instructions, such as loads, branches, multiplies, divides,• pipeline stalls because of cache misses, traps, resource conflicts.
Thread selection policy: the least recently used policy.
1. Example:
• all 4 threads are available.
4.2.1. SUN UltraSPARC T1 (13)
Figure 4.2.6: Thread switch in the SUN’s UltraSPARC T1 when all threads are available
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
t1-sub
4.2.1. SUN UltraSPARC T1 (14)
2. Example:
•There are only 2 threads available,•speculative execution of instructions following a load.
(Data referenced by a load instruction arrive in the 3. cycle after decoding, assuming a cache hit. So, after issuing a load the thread becomes unavailable for the next two subsequent cycles.)
4.2.1. SUN UltraSPARC T1 (15)
Figure 4.2.7: Thread switch in the SUN’s UltraSPARC T1 when only two threads are available
Source: Kongetira P., et al. „Niagara”, IEEE Micro, March/April 2005, Vol. 25, No. 2, pp. 21-29
(Thread t0 issues a ld instruction and becomes unavailable for two cycles.The add instruction from thread t0 is speculatively switched into the pipeline assuming a cache hit.)
t1-sub
t0 yetunavailable
ld data available
4.2.1. SUN UltraSPARC T1 (16)
4.2.2. Cell BE
Overview of the Cell BE
Processor components
Multithreading the PPE
Programming models
Implementation of the Cell BE
Cell BE
Objective: Speeding up game/multimedia apps.
Used: In the PlayStation 3 (PS3) and in the QS20 Blade Server
Goal: 100 times the PS 2 performance.
History
Summer 2000: High level architectural discussionsEnd 2000:Architectural conceptMarch 2001: Design Center opened in Austin TX.Spring 2004: Single Cell BE operationalSummer 2004: 2-way SMP operationalFebr. 2005: First technical disclosuresOct. 2005: Mercury announces Cell BladeNov. 2005: Open Source SDK & Simulator publishedFebr. 2006: IBM announced Cell Blade QS20
CELL BE: Collaborative effort from Sony, IBM and Toshiba
Cell BE at NIK
May 2007: QS20 arrives at NIK within IBM’s loan program
Overview of the Cell BE (1)
• 9 cores;
the PPE (Power Processing Element), a dual threaded, dual issue 64-bit Power PC compliant processor and
8 SPEs (Synergistic Processing Elements),single threaded, dual issue 128-bit SIMD processors.
• the EIB (Element Interconnection Bus, an on-chip interconnetion network,
• the MIC (Memory Interface Controller), a Memory Controller supporting
dual Rambus XDR channels and
• the BIC (Bus Interface Controller) that interfaces the Rambus Flex IO bus.
Overview of the Cell BE (2)
Main functional units of the Cell BE
EIB: Element Interface Bus
Figure 4.2.8: Block diagram of the Cell BE [4.2.2.1]
SPE: Synergistic Procesing ElementSPU: Synergistic Processor UnitSXU: Synergistic Execution UnitLS: Local Store of 256 KBSMF: Synergistic Mem. Flow Unit
PPE: Power Processing ElementPPU: Power Processing UnitPXU: POWER Execution Unit
MIC: Memory Interface Contr.BIC: Bus Interface Contr.
XDR: Rambus DRAM
Overview of the Cell BE (3)
a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations)
The PPE• is optimized to run a 32/64-bit OS• controls usually the SPEs,• complies with the 64-bit PowerPC ISA.
• the PPE is more adept at control-intensive tasks and quicker in task switching,• the SPEs are more adept at compute intensive tasks and slower at task switcing.
• are optimized to run compute intensive SIMD apps.,• operate usually under the control of the PPE,• run their individual apps. (threads),• have full access to a coherent shared memory including the memory mapped I/O-space,• can be programmed in C/C++.
The SPEs
Contrasting the PPE and the SPEs
Unique features of the Cell BE
Overview of the Cell BE (4)
b) The SPEs have an unusual storage architecture, as
• SPEs
• The LS
access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands).
operate in connection with a local store (LS) of 256 KB, i.e.
o they fetch instructions from their private LS and
o their Load/Store-instructions access their LS rather than the main store,
• SPEs
has no associated cache.
Overview of the Cell BE (5)
Although the PPE and the SPEs have coherent access to main memory,
the Cell BE is not a traditional shared-memory multiprocessor
as SPEs operate in connnection with the LS rather than with the main memory.
Overview of the Cell BE (6)
• Fully compliant 64-bit Power processor (Architecture Specification 2.02)
• fc = 3.2 GHz (11 FO4 design, 23 pipeline stages).
• Dual-issue, in-order superscalar, two-way (fine grained) multithreaded core.
• Conventional cache architecture of 32 KB I$, 32 KB D$, 512 KB unified L2.
PPE (Power Processing Element) [4.2.2.2]
Processor components of the Cell BE (1)
Figure 4.2.9: Main functional units of the PPE [4.2.2.3]
Processor components of the Cell BE (2)
• IU (Instruction unit)
• Shared decode, dependency checking and issue pipeline
• Microcode Engine
predecodes instructions while loading them from the L2 cache to the L1 cache,
fetches 4 instructions per cycle alternating between the two threads from the L1 instr. cache into two instruction buffers (each one for each thread),
dispatches instructions from the two instruction buffers to the shared decode, dependency checking and issue pipeline according to the thread scheduling rules.
Instructions that are either difficult to implement in hardware or are rarely used can be split into a few simple PowerPC instructions and are stored in a ROM (such as Load string or several Condition Register (CR) instructions.
Most microcoded instructions are split into two or three microcoded instruction.
The Microcode Engine inserts microcoded instructions from one thread into the instruction flow with a delay of 11 clock cycles.
The Microcode Engine stall dispatching from the instruction buffers until the last microcode of the microcoded instruction is dispatched.
The next dispatch cycle belongs to the thread that did not invoke the Microcode Engine.
Receives dispatched instructions (up to two in each cycle from the same thread),
it decodes them, checks for dependencies, and issues instructions for execution according to the issue rules.
Main components of the PPE
Processor components of the Cell BE (3)
• XU (FX Execution Unit)
32x64-bit register file/thread
FXU (FX Unit)
LSU (L/S Unit)
BRU (Branch Unit)
Per thread branch prediction (6 bit global history, 4 K x 2 bit history table)
• VSU (Vector Scalar Unit)
FPU (FP Unit):
• 32x64 bit register file/thread• 10-stage double precision pipeline
VMX (Vector-Media Execution Unit), called also as the VXU (Vector Execution Unit)
o 32x128 bit vector register file/thread,o simple, complex, permute and single-precision FP subunits,o 128-bit SIMD instructions with varying data width
(2x64-bit, 4x32-bit, 8x16-bit, 16x8-bit, 128x1-bit).
VMX/FPU issue queue (Vector-Media Execution Unit/FP Unit) called also as the VSU (Vector-Scalar Unit) issue queue (two entries)
Processor components of the Cell BE (4)
Instr. fetch
• Instruction fetch operates autonomously in order to keep each thread’s
instruction buffer full with useful instructions that are likely to be needed.
• 4 instr./cycle are fetched strictly alternating between the two threadsfrom the L1 I$ to the private Instruction Buffers of the threads.
• The fetch address is taken from the Instruction Fetch Address Registersassociated with each thread (IFAR0, IFAR1). The IFARs are distinct from theProgram Counters (PC) associated with both threads; the PCs track the actualprogram flow while the IFARs track the predicted instruction execution flow.
• Accessing the taken path after a predicted-taken branch requires 8 cycles.
Basic operation of the PPE
Processor components of the Cell BE (5)
Instruction dispatch
• Moves up to two instructions either from one of the Instruction Buffers or the Microcode Engine (complex instructions) to the shared decode, dependency check and issue pipeline.
• Instruction dispatch is governed by the dispatch rules (thread scheduling rules).
• The dispatch rules take into account thread priority and stall conditions(see slide 115).
• Each pipeline stage beyond the dispath point contains instructions from one thread only.
Instruction decode and dependency checking
Decoding of up to two instructions from the same tread in each cycleand checking for dependencies
Processor components of the Cell BE (6)
Figure 4.2.10: Instruction flow in the PPE [4.2.2.4]
IU: Instruction Unit
VSU: Vector Scalar Unit
XU: FX Execution Unit
VXU: Vector Execution Unit
FPU: FP Execution Unit
BRU: Branch Unit
FXU: FX Execution Unit
LSU: L/S Execution Unit
(IFAR: Instr. Fetch Addr.)
ibuf: Instr. Buffer
IC: Instruction cache
IB: Instruction buffer
ID: Instruction decode
IS: Instruction issue
Pipeline stages Units
4
2
Processor components of the Cell BE (7)
• Forwarding up to two PowerPC or vector/SIMD multimedia extension instructionsper cycle from the IS2 pipeline stage for execution to the
VSU (VMX/FPU) issue queue (up to two instr./cycle) or the
BRU, LSU, FXU execution units (up to one instr./cycle per execution unit).
• Any issue combinations are allowed, except two instructions to the same unitwith a few restrictions. (See Figure 4.2.11 for the valid issue combinations.)
Note that valid resp. invalid issue combinations result from the underlyingmicroarchitecture, as shown in Figure 4.2.13.
• Instructions are issued in each cycle from the same thread.
• Instruction issue can be stalled at the IS2 pipeline stage for various reasons, like invalidissue combinations, full VSU issue queue.
Instruction issue at the pipeline stage IS2
Instruction issue from the VSU (VMX/FPU) issue queue
Forwarding up to two VMX or FPU instructions to the respective execution units.
Note that instructions kept in the issue queue are already prearranged for execution, i.e. they obey the issue restrictions summarized in Figure 4.2.11.
Processor components of the Cell BE (8)
(older instr.)
(younger instr.)
Figure 4.2.11: Valid issue combinations (designated as pink squares) [4.2.2.4]
Type 1 instructions: VXU simple, VXU complex, VXU FP and FPU arithmetic instructions,Type 2 instructions: VXU load, VXU store, VXU permute, FPU load and FPU store instructions.
Processor components of the Cell BE (9)
Figure 4.2.12: Pipeline stages of the PPE [4.2.2.3]
Processor components of the Cell BE (10)
• Four 16 byte data rings, supporting multiple transfers
• 96B/cycle peak bandwidth
• Over 100 outstanding requests
• 300+ GByte/sec @ 3.2 GHz
EIB data ring for internal communication [4.2.2.2]
Processor components of the Cell BE (11)
SPE [4.2.2.2]
SPE
Main Components
a) SPU (Synergistic Processing Unit)
b) MFC (Memory Flow Controller)
c) LS (Local Store)
d) AUC (Atomic Unit and Cache)
• SPEs are not intended to run an OS
• SPEs optimized for data-rich operation
• are allocated by the PPE
Processor components of the Cell BE (12)
a) SPU
• Dual-issue superscalar RISC core supporting basically a 128-bit SIMD ISA.
• The SIMD ISA provides FX, FP and logical operations on 2x64-bit, 4x32-bit, 8x16-bit,
16x8-bit and 128x1-bit data.
• In connnection with the MFC the SPU support also a set of commands for
- performing DMA transfers,
- interprocessor messaging and
- synchronization.
• The SPU executes instructions from the LS (256 KB),
• Instructions reference data from the 128x128-bit unified register file,
• The Register file fetches/delivers data from/to the LS by L/S instructions,
• The SPU moves instructions and data between the main memory and the local store by requesting DMA transfers from its MFC. (Up to 16 outstanding DMA requests allowed).
Overview
Processor components of the Cell BE (13)
SPULS
MFC
Even pipe Odd pipe
Figure 4.2.13: Block diagram of the SPU [4.2.2.3]
Processor components of the Cell BE (14)
• Instruction issue unit – instruction line buffer
Fetches 32 instructions per LS request from the LS into the Instruction line buffer.
Main components of the SPU
• Register file Unified Register file of 128 registers each 128-bit wide.
• Result forwarding and staging
Instructions are staged in an operand staging network for up to 6 additinal cycles to achieve that all execution units write their results in the Register file in the same pipeline stage. (See Figure ccc.
Instruction fetching is supported by hardware prefetching. Pefetching requires 15 cycles to fill the instruction line buffer. Fetched instructions are decoded and issued (up to two instructions per cycle) according to the issue rules.
Processor components of the Cell BE (15)
The odd pipeline includeso the Channel unit,o Branch unit,o Load/Store unit and o the Permute unit.
• Execution units
Execution units are organised into two pipelines.
The even pipeline includeso the Fixed-point unit ando the Floating-point unit.
Processor components of the Cell BE (16)
Instruction issue
• The SPU issues up to two instructions per cycle from a 2-instructions wide issue window, called the fetch group.
• Fetch groups are aligned to doubleword boundaries, i.e. the first instructionis at an even and the second one at an odd word address. (Words are
4-Byte long).
• An instruction becomes issueable when no register dependencies or resource conflicts, e.g. busy execution units, exist.
• Instructions are issued in program order, that is
- if the first instruction of a fetch group can be issued to the even pipeline and the second instruction to the odd pipeline both instructions are issued in the same cycle,
- in all other cases instruction issue needs two cycles such that instructions are issued in program order to the pertaining pipeline (see Figure 4.2.14).
• Register or resource conflicts stall instruction issue.
• A new fetch group is loaded after both instructions of the current fetch group are issued.
Basic operation of the SPU
Processor components of the Cell BE (17)
Figure 4.2.14: Instruction issue example [4.2.2.4]
(Assuming that instruction issue is not constrained by register or resource conflicts)
Processor components of the Cell BE (18)
Figure 4.2.15: The channel interface between the SPU and the MFC [4.2.2.4]
SPU channels
• An SPU communicates with its associated MFC as well as (via its MFC) with the PPE,
other SPEs and devices (such as a decrementer) through its channels.
MMIO: Memory-Mapped I/O Registers
SLC: SPU Load and Store Unit
SSC: SPU Channel and DMA Unit
Processor components of the Cell BE (19)
• SPU channels are unidirectional interfaces for
sending commands (such as DMA commands) to the MFC, owned by the SPU or
sending/receiving up to 32-bit long messages between the SPU and the PPE or other SPEs.
• Each channel has
- a corresponding capacity (maximum message entries) and
- count (remaining available message entries).
The channel count
- decrements when ever a channel instruction (rdch or wrch)is issued, and
- increments whenever an operation associated with the channnel completes.
The channel count of „0” means
- empty for read only channels and
- full for write only channels.
• SPU channels are implemented in and managed by the MFC.
Processor components of the Cell BE (20)
Figure 4.2.16: Assembler instruction mnemonics and their corresponding C-language intrinsics of the channel instructions available for the SPU [4.2.2.4]
• The SPU can read or write its channels by three instructions;
the read channel (rdch),
write channel (wrch) and
read channel count (rchcnt)
instructions.
(Intrinsics represent in-line assembly code segments in the form of C-language function calls).
Processor components of the Cell BE (21)
• The channel instructions or DMA commands evoked by channel instructions are enqueued for execution in the MFC for purposes like
initiating DMA transfers between the SPE’s LS and the main storage,
queuring DMA and SPU status,
sending or receiving up to 32-bit long mailbox messages primarily between
the SPU and the PPE or
sending or receiving up to 32-bit long signal-notification messages
between the SPU and the PPE or other SPEs.
• The PPE and other devices in the system including other SPEs, can also access the channels through the MFC’s memory mapped I/O (MMIO) registers and queues, which are visible to software in the main storage space.
Processor components of the Cell BE (22)
Figure 4.2.17: SPE channels and associated MMIO registers (1) [4.2.2.4]
Processor components of the Cell BE (23)
Figure 4.2.18: SPE channels
and associated MMIO registers (2) [4.2.2.4]
Processor components of the Cell BE (24)
Figure 4.2.19: Pipeline stages of the SPUs [4.2.2.1]
Processor components of the Cell BE (25)
Processor components of the Cell BE (26)
b) Memory Flow Controller (MFC) [4.2.2.2]
• acts as a specialized co-processor for its associated SPU by executing autonomously its own command set and
• serves as the SPU’s interface, via the EIB to main storage and other processor elements, such as other SPEs or system devices.
Figure 4.2.20: Block diagram of the MFC [4.2.2.4]
MMIO: Memory-Mapped I/O Registers
SLC: SPU Load and Store Unit
SSC: SPU Channel and DMA Unit
Processor components of the Cell BE (27)
The MFC as a specialized co-processor
• DMA commands
• DMA List commands and
• synchronization commands.
• DMA commands (put, get)
• can be initaiated by both the PPE and the SPU,
• move up to 16 KByte of data between the LS and the main storage,
• supports transfer sizes of 1, 2, 4, 8. 16 bytes and multiples of 16 bytes,
• access main store by using main storage effective addresses,
• can be tagged with a 5-bit tag (tag group ID) to allow special handling
within the tag group, such a to enforce ordering of the DMA commands.
• DMA list commands (put, get commands with the command modifier l)
• can be initiated only by the SPU,
• consist of up to 2 K 8-byte long list elements,
• each list element specifies a DMA transfer
• used to move data between a contiguous area in the LS and possible noncontiguous area in the effective address space implementing scatter-gather functions between main storage and the LS).
It executes three types of commands
Processor components of the Cell BE (28)
• Synchronization comands
• used basically to control the order of storage accesses,
• include atomic commands (a form of semaphores), send signal commands and barrier commands.
• The MFC maintains two separate command queues
- the 16-entry SPU comand queue for commands from the SPU associated with the MFC, and
- the 8-entry proxi command queue for commands from the PPE, other SPEs and devices.
Operation of the MFC
• The MFC supports out-of-order execution of DMA commands.
Processor components of the Cell BE (29)
• supports storage protection on the main storage side while performing DMA transfers,
• maintains synchronization between main storage and the LS,
• performs intercore communication functions, such as mailbox and signal-notification messaging with the PPE, other SPEs and devices.
The MFC as the interface between the SPU and the main storage, the PPE and other devices
Processor components of the Cell BE (30)
Intercore communication tools of the MFC
• three mailboxes, primarily intended for holding up to 32-bit long messages from/to the SPE:
- one four-deep mailbox for receiving mailbox messages and
- two one-deep mailbox for sending mailbox messages.
• two signal notification channels for receiving signals sent basically by the PPE.
Processor components of the Cell BE (31)
Figure 4.2.21: Contrasting mailboxes and signals [4.2.2.4]
Processor components of the Cell BE (32)
c) Local Store [4.2.2.2]
SPE
Processor components of the Cell BE (33)
• Single-port SRAM cell.
• Executes DMA reads/writes and instruction prefetches via 128-Byte wide read/write ports
• Executes instruction fetches and load/stores via 128-bit read/write ports.
• Asynchronous, coherent DMA commands are used to move instructions and data between the local store and system memory.
• DMA transfers between the LS and the main storage are executed by the SMF’s DMA unit
• A 128-Byte long DMA read or write requires 16 processor cycles to forward data on the EIB.
d) The Atomic Update and Cache unit [4.2.2.2]
SPE
Processor components of the Cell BE (34)
The Atomic Unit
• executes atomic operations (a form of mutual-exclusion (mutex) operations) invoked by the MFC,
• supports Page Table lookups and
• maintains cache coherency
by supporting snoop operations.
The Atomic Cache
six 128-byte cache lines of datato support atomic operationsand Page Table accesses.
Broadband Interface Controller (BIC) [4.2.2.2]
• Provides a wide connection to external devices
• Two configurable interfaces (50+GB/s @ 5Gbps)
Configurable number of bytes
Coherent (BIF) and/or I/O (IOIFx) protocols
• Supports two virtual channels per interface
• Supports multiple system configurations
Memory Interface Controller (MIC)
• Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
• ECC support
• Suspend to DRAM support
Processor components of the Cell BE (35)
Thread scheduling
depends both on
• thread states
• thread priorities
• single threaded or dual threaded mode of execution
Multithreading the PPE (1)
Scheduling of PPE threads
1. Thread states
• Privilege states
• Suspended/enabled state
• Blocked/not blocked state
a) Privilege States
• Hypervisor state
• Supervisor state
• Problem state (user state)
• most privileged
• allows to run a meta OS that manages logical partitions in which multiple OS instances can run
• some system operations require the initiating thread to be in hypervisor state
• is the state in which an OS instance is intended to run
• is the state in which an application is intended to rum
Multithreading the PPE (2)
Figure 4.2.22: Bits of the Machine State Register (MSR) defining the privilege state of a thread[4.2.2.4]
(HV: Hypervisor, PR: Problem)
Multithreading the PPE (3)
b) Suspended/enabled State
• a thread in the hypervisor state can change its state from enabled to suspended.
• Two bits of the Control Register (CTRL[TE0], [TE1]) define whether a thread is in the suspended or enabled state.
c) Blocked/stalled State
• Blocking- occurs at the instruction dispatch stage if the thread selection rule favours the other thread, or due to a special „nop” instruction,
- stops only one of the two threads.
• Stalling
- occurs at the instruction issue stage due to dependencies
- stops both threads.
- for very long latency conditions, such as L1 cache misses, or devide instructions, stalling both threads is avoided, by
-- flushing instructions younger than the stalled instruction,
-- instructions starting with the stalled instruction are refetched and
-- the thread is stalled at the dispatch stage until the stall condition is removed, but then the other thread can be continued to dispatch.
Multithreading the PPE (4)
2. Thread priorities
• determines dispatch priority
• four priority levelsthread disabledlow prioritymedium priorityhigh priorith
• priority levels are specified by a 2-bit field (TP field) of the TSRL register (Thread Status Regiter Local)
• Software, in particular OS software, sets thread priorities (according to the throughput requirements of the programs running in the threads.) E.g. a foreground/background thread priority scheme can be set, to favor one thread over the other when allocating instruction dispatch slots.
• A thread must be in the hypervisor or supervisor state to set its priority to high.
Multithreading the PPE (5)
Figure 4.2.23:Usual thread priority combinations [4.2.2.4]
The combination high priority thread/low priority thread is not expected to be used,as in this case the PPE would never dispatch instructions from the low priority threadunless the high priority thread was unable to dispatch.
Usual thread priority combinations
Multithreading the PPE (6)
Figure 4.2.24: Thread scheduling when both priorities are set to medium [4.2.2.4]
• The PPE attempts to utilize all available dispatch slots.
• Thread scheduling is fair (round robin scheduling).
• If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt.
Note:The same scheduling applies when both threads are set to high priority.
Basic scheduling rules
Example (1): Scheduling in case of the medium priority/medium priority setting
Multithreading the PPE (7)
• The PPE attempts to utilize most available dispatch slots for the medium priority thread (this setting is appropriate to run a low-priority program in the background)
• Assuming a duty cycle of 5 (TSRL[DISP_COUNT] = 5) instructions from thread 1 are dispatched on four out of five cycles while instructions from thread 0 are dispatched only on one out of five cycles.
• If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt.
Basic scheduling rules
Figure 4.2.25: Thread scheduling when one thread runs at medium priority while the otherat low priority [4.2.2.4]
Example (2): Scheduling in case of the low priority/medium priority setting
Multithreading the PPE (8)
Basic scheduling rules
• Assuming a duty cycle of 5, both threads are scheduled only once every 5 cycles.
• The PPE attempts to dispatch only once every duty cycle (TSCR[DISP_COUNT]) cycles. (With high values of DISP-COUNT the PPE will mostly idle, which will reduce power comsuption and heat production while keepint both threads alive.)
• If a thread under consideration is unable to dispatch an instruction in a given slot, the other thread will be allowed to dispatch even if it was selected for dispatch on the previous attempt.
• Thread scheduling is fair (round robin scheduling)
Figure 4.2.26: Thread scheduling when both priorities are set to low [4.2.2.4]
Example (3): Scheduling in case of the low priority/low priority setting
Multithreading the PPE (9)
3. Single threaded/dual threaded mode of execution
• In single threaded mode all resources are allocated to a single thread, this reduces the turnaround time of the thread.
• Software can change the operating mode of the PPE between single threaded and dual threaded mode only in the hypervisor state .
Multithreading the PPE (10)
Software controlled thread behaviour
software can use various schemes to controll thread behaviour, including
• enabling and suspending a thread,
• by setting thread priorities to control instruction dispatch policy,• executing a special nop instruction to cause temporary dispatch blocking,
• switching the state of the PPE between single threaded and multithreaded mode.
Multithreading the PPE (11)
• Duplicated architectural states for
32 GPRs32 FPRs32 Vector Registers (VRs)Condition Register (CR)Count Register (CTR)Link Register (LR)FX Exception Register (XER)FP Status and Control Register (FPSCR)Vector Status and Control Register (VSCR)Decrementer (DEC)
Multithreading the PPE (12)
Core enhancements for multithreading
• Duplicated microarchitectural states for
Branch History Table (BHT) with global branch history
(to allow independent and simultaneous branch prediction for both threads)
Internal registers associated with exceptions and interrupt handling, such as
Machine State Register (MR)Machine Status Save/Restore Registers (SRR0, SRR1)Hipervisor Machine Status Save/Restore Registers (HSRR0, HSRR1)FP Status and Control Register (FPSCR) etc.
(to allow concurrent exception and interrupt handling)
Multithreading the PPE (13)
• Duplicated queues and arrays
Segment lookaside buffer (SLB)Instruction buffer queue (Ibuf) (to allow each thread to dispatch regardles of any dispatch stall in the other thread)Link stack queue
The instruction fetch control (because the I$ has only one read port and so fetching must alternate between threads every cycle).
• Shared resources
Hardware execution units
Virtual memory mapping (as both threads always execute in the same logical partitioning context)
Most large arrays and queues, such as caches that consume significant amountof chip area
Multithreading the PPE (14)
• Application specific SPU
accelerators,
• Multi-stage pipeline SPU
configuration or
• Parallel-stages SPU configuration;
Basic SPU configurations
assumes the choice of an appropriate SPU configuration
Multithreading the PPE (15)
The programming model
Application specific SPU accelerators [4.2.2.5]
Multithreading the PPE (16)
Multi-stage SPU pipeline configuration [4.2.2.5]
Multithreading the PPE (17)
Parallel-stages SPU configuration [4.2.2.5]
Programming models (1)
• Programmer writes/uses SPU „libraries” either for
Application specific SPU accelerators,
Multi-stage pipeline SPU configuration or
Parallel-stages SPU configuration; e.g. for;
Basic approach for creating an application
• Programmer chooses the appropriate SPU configuration according to the features of an application, such as
Graphics processing Audio processing MPEG Encoding/Decoding Encryption/Decryption
• Main application in PPE, invokes SPU bound services by
creating SPU threads RPC like function calls I/O device like interfaces (FIFO/command queue)
• One ore more SPUs cooperate in the presumed SPU configuration to execute the tasks required.
Programming models (2)
• Acceleration
provided by OS or application libraries
• Application portability
maintained with platform specific libraries
Programming models (3)
Programming models (4)
Aim
• showing the cooperation between PPE and SPE
Program
• Actual goal: To calculate distance travelled in a car
• It asks for: elapsed time speed
Program structure
• There are two program codes, one for the PPE and one for the SPE.
• The PPE does the user input, then it calls the SPE executable which calculates the distance and then returns with the result.
• The result is then given to the user by the PPE.
Example
PPE Main Store
SPE 1
SPE n
notifying SPE of work to be done
(create_spu_thread)
Loading program
and data to main store
Local Store 1
Local Store n
copying data from MS to LS
(mfc_get)
Accessing data
Example
Programming models (5)
PPE Main Store
SPE 1
SPE n
Execution of the SPEthread
Local Store 1
Local Store n
Programming models (6)
PPE Main Store
SPE 1
SPE n
SPE notifies PPE„job is finished”
by sending a message
Loading results
from main store
Local Store 1
Local Store n
copying data from LS to MS
(mfc_get)
Updating
results
Programming models (7)
#include <stdio.h>#include <libspe.h>
extern spe_program_handle_t calculate_distance_handle;
typedef struct {float speed; //input parameterfloat num_hours; //input parameterfloat distance; //output parameterfloat padding; //pad the struct a multiple of 16 bytes
} program_data;
int main() {program_data pd __attribute__((aligned(16))); //aligned for transfer
printf("Enter the speed in miles/hr: ");scanf("%f", &pd.speed);printf("Enter the number of hours you have been driving: ");scanf("%f", &pd.num_hours);
speid_t spe_id = spe_create_thread(0, &calculate_distance_handle, &pd,NULL, -1, 0);
spe_wait(spe_id, NULL, 0);
printf("The distance travelled is %f miles.\n", pd.distance);
return 0;}
External SPE program (next slide)
Define the data structurepassed to the SPE task
Data input
Create the thread and wait for it to finish
Data output
Programming models (8)
#include <spu_mfcio.h>
typedef struct {float speed; //input parameterfloat num_hours; //input parameterfloat distance; //output parameterfloat padding; //pad the struct a multiple of 16 bytes
} program_data;
int main(unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) {
program_data pd __attribute__((aligned(16)));int tag_id = 0;
mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0);mfc_write_tag_mask(1<<tag_id);mfc_read_tag_status_any();
pd.distance = pd.speed * pd.num_hours;
mfc_put(&pd, program_data_ea, sizeof(program_data), tag_id, 0, 0);mfc_write_tag_mask(1<<tag_id);mfc_read_tag_status_any();return 0;
}
Define the data structureto communicate with the SPE
Copy data from MS to LSWait for completition
Calculate the result
Copy data from LS to MSWait for completition
Programming models (9)
Implementation of the Cell BE (1)
Figure 4.2.27: Cell system configuration options [4.2.2.3]
Implemetation alternatives
Source: Brochard L., A Cell History,” Cell Workshop, April, 2006 http://www.irisa.fr/orap/Constructeurs/Cell/Cell%20Short%20Intro%20Luigi.pdf
Figure: Cell BE Blade Roadmap
Implementation of the Cell BE (2)
Implementation of the Cell BE (3)
Figure 4.2.28: Motherboard of the Cell Blade (QS20) [4.2.2.5]
Motherboard of the Cell Blade (QS20)
Source: Hofstee H. P., „Real-time Superconputing and Technology for Games and Entertainment,” 2006, http://www.cercs.gatech.edu/docs/SC06_Cell_111606.pdf
Figure: Roadmap of the Cell BE
Cell BE roadmap (1)
References
Cell BE
[4.2.2.1] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006,http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf
[4.2.2.2] Hofstee P., „Tutorial: Hardware and Software Architectures for the CELL BROADBAND ENGINE processor”, IBM Corp., September 2005http://www.crest.gatech.edu/conferences/cases2005/pdf/Cell-tutorial.pdf
[4.2.2.3] Kahle J.A., „Introduction to the Cell multiprocessor”, IBM J. Res & Dev Vol. 49, 2005, pp. 584-604 http://www.research.ibm.com/journal/rd/494/kahle.pdf
[4.2.2.4]: Cell Broadband Engine Programming Handbook Vers. 1.1, Apr. 2007, IBM Corp.
[4.2.2.5] Cell BE Overview, Course code: L1T1H1-02, May 2006, IBM Corp.
4.3 SMT multithreaded processors
4.3.1. Intel Pentium 4
4.3.2. Alpha 21464 (V8)
4.3.3. IBM Power5
Coarse grained MT cores
Fine grained MT cores SMT cores
Thread scheduling in multithreaded cores
4.3. Simultaneously multithreaded processors
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores),
followed by in the Northwood core intended for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
• instruction pointer,• the general purpose regs.,• the control regs.,• the APIC (Advanced Programable Interrupt Controller) regs.,• some machine state regs.
4.3.1. Intel Pentium 4 (1)
Figure 4.3.1. Intel Pentium 4 and the visible processor resources duplicated to support hyperthreading technology. Hyperthreading requires duplication of additional miscellaneous
pointers and control logic, but these are too small to point them out.
Source: Koufaty D. and Marr D.T. „Hyperthreading Technology in the Netburst Microarchitecture, IEEE. Micro, Vol. 23, No.2, March-April 2003, pp. 56-65.
4.3.1. Intel Pentium 4 (2)
• Further enhancements to support MT (thread microstate):
• TC-entries (Trace cache) are tagged,• BHB (Branch History Buffer) is duplicated,• Global History Table is tagged,• RAS (Return Address Stack) is duplicated,• Rename tables are duplicated,• ROB is tagged.
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores),
followed by the Northwood core for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
• instruction pointer,• the general purpose regs.,• the control regs.,• the APIC (Advanced Programable Interrupt Controller) regs.,• some machine state regs.
4.3.1. Intel Pentium 4 (3)
Figure 4.3.2: SMT pipeline in Intel’s Pentium 4/HT
Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture”,Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16
4.3.1. Intel Pentium 4 (4)
Additional die area required for MT: less than 5 %.
Single thread/dual thread modes: To prevent single thread performance degradation:
in single thred mode partitioned resources are recombined.
• Further enhancements to support MT (thread microstate):
• TC-entries (Trace cache) are tagged,• BHB (Branch History Buffer) is duplicated,• Global History Table is tagged,• RAS (Return Address Stack) is duplicated,• Rename tables are duplicated,• ROB is tagged.
Intel designates SMT as Hyperthreading (HT)
Introduced in the Northwood based DP- and MP-server cores in 2/2002 and 3/2002 resp.(called the Prestonia and Foster MP cores), followed by the Northwood core for desktops in 11/2002.
Additions for implementing MT:
• Duplicated architectural state, including
• instruction pointer,• the general purpose regs.,• the control regs.,• the APIC (Advanced Programable Interrupt Controller) regs.,• some machine state regs.
4.3.1. Intel Pentium 4 (5)
Alpha 21264 Alpha 21464
GPRsFPRs
8080
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
512
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
In 2001 all Alpha intellectual property rights were sold to Intel.
4.3.2. Alpha 21464 (V8) (1)
Figure 4.3.3: SMT pipeline in the Alpha 21464 (V8)
Better answers
SMT PipelineSMT Pipeline
Fetch Decode/Map
Queue RegRead
Execute Dcache/Store Buffer
RegWrite
Retire
IcacheDcache
PC
RegisterMap
Regs Regs
Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com
4.3.2. Alpha 21464 (V8) (2)
Alpha 21264Alpha 21464
GPRsFPRs
8080
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
• Providing replicated (4 x) thread microstates for:
Register Maps,
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
512
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
In 2001 all Alpha intellectual property rights were sold to Intel.
4.3.2. Alpha 21464 (V8) (3)
Figure 4.3.4: SMT pipeline in the Alpha 21464 (V8)
Better answers
SMT PipelineSMT Pipeline
Fetch Decode/Map
Queue RegRead
Execute Dcache/Store Buffer
RegWrite
Retire
IcacheDcache
PC
RegisterMap
Regs Regs
Source: Mukkherjee S., „The Alpha 21364 and 21464 Microprocessors,” http://www.compaq.com
4.3.2. Alpha 21464 (V8) (4)
Alpha 21264Alpha 21464
GPRsFPRs
8080
Core enhancements for 4-way multithreading:
• Providing replicated (4 x) thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
• Providing replicated (4 x) thread microstates for:
Register Maps,
Additional core area needed for SMT: ~ 6 %
8-way superscalar, scheduled for 2003, but canceled in June 2001 in favour of the Itanium line.
512
Source: :Preston R. P. and all., Design of an 8-wide Superscalar RISC Microprocessor with Simultaneous Mltithreading”, Proc. ISSCC, 2002, pp. 334-243
In 2001 all Alpha intellectual property rights were sold to Intel.
4.3.2. Alpha 21464 (V8) (5)
POWER5 enhancements vs the POWER4:
• on-chip memory control,
4.3.3. IBM POWER5 (1)
Figure 4.3.6: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
FabricController
4.3.3. IBM POWER5 (4)
POWER5 enhancements vs the POWER4:
• on-chip memory control,• exclusive L3 cache,
4.3.3. IBM POWER5 (3)
Figure 4.3.5: POWER4 and POWER5 system structures
Source: R. Kalla, B. Sinharoy, J.M. Tendler: IBM Power5 chip: A Dual-core multithreaded Processor, IEEE. Micro, Vol. 24, No.2, March-April 2004, pp. 40-47.
FabricController
4.3.3. IBM POWER5 (2)
Inclusive L3 cache
Exclusive L3 cache
POWER5 enhancements vs the POWER4:
• on-chip memory control,• exclusive L3 cache,• dual threaded.
4.3.3. IBM POWER5 (5)
Figure 4.3.7: Microarchitecture of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (6)
Figure 4.3.8: IBM POWER5 Chip
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (7)
POWER4 POWER5
GPRsFPRs
80 12072 120
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
4.3.3. IBM POWER5 (8)
Figure 4.3.9: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (9)
POWER4 POWER5
GPRsFPRs
80 12072 120
Core enhancements for multithreading:
• Providing duplicated architectural states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
4.3.3. IBM POWER5 (10)
Figure 4.3.10: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (11)
POWER4 POWER5
GPRsFPRs
80 12072 120
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
• Providing increased (in fact duplicated) sizes for scarce or sensitive resorces, such as:
Instruction Buffer, Store Queue
4.3.3. IBM POWER5 (12)
Figure 4.3.11: SMT pipeline of IBM’s POWER5
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (13)
POWER4 POWER5
GPRsFPRs
80 12072 120
Core enhancements for multithreading:
• Providing duplicated thread states for:
PC, architectural registers (by increasing the sizes of the merged GPR and FPR architectural and rename reg. files):
• Providing duplicated thread microstates for:
Return Address Stack, Group Completion (ROB)
Additional core area needed for SMT: ~ 10 %
• Providing increased (duplicated) size for scarce or sensitive resorces, such as:
Instruction Buffer, Store Queue
4.3.3. IBM POWER5 (14)
Unbalanced execution of threads:
(an enhancement of the single mode/dual mode thred execution model)
• Threads have 8 priority levels (0...7) controlled by HW/SW,• the decode rate of each thread will be controlled according to the associated priority
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
Figure 4.3.12: Unbalanced execution of threads in IBM’s POWER5
4.3.3. IBM POWER5 (15)
Difference inthread priority
Development effort:
• Concept phase: ~ 10 persons/ 4 month• High level design phase: ~ 50 persons/ 6 month• Implementation phase: ~ 200 persons/ 12-18 month
Source: Kalla R., „IBM's POWER5 Micro Processor Design and Methodology”, IBM Corporation, 2003
4.3.3. IBM POWER5 (16)