cs 152 computer architecture and engineering lecture 22 advanced...
TRANSCRIPT
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
2005-4-12John Lazzaro
(www.cs.berkeley.edu/~lazzaro)
CS 152 Computer Architecture and Engineering
Lecture 22 – Advanced Processors III
www-inst.eecs.berkeley.edu/~cs152/
TAs: Ted Hong and David Marquardt
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Last Time: Dynamic SchedulingReorderBuffer
Inst # [...] src1 # src1 val src2 # src2 val dest # dest val
6
7
[...]
StoreUnit
To Memory
LoadUnit
From Memory
ALU #1 ALU #2
Each lineholds
physical<src1, src2,
dest>registers
for an instruction,
and controlswhen it
executesExecution engine works on the physicalregisters, not the architecture registers.
Common Data Bus: <dest #, dest val>
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Today: Throughput and multiple threadsGoal: Use multiple instruction streams to improve (1) throughput of machines that run many programs (2) execution time of multi-threaded programs.
Difficulties: Gaining full advantage requires rewriting applications, OS, libraries.
Ultimate limiter: Amdahl’s law (application dependent). Memory system performance.
Example: Sun Niagara (32 instruction streams on a chip).
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Throughput Computing
Multithreading: Interleave instructionsfrom separate threads on the same hardware. Seen by OS as several CPUs.
Multi-core: Integrating several processors that (partially) share a memory system on the same chip
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Multi-Threading(static pipelines)
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Mux,Logic
Recall: Bypass network prevents stalls
rd1
RegFile
rd2
WEwd
rs1
rs2
ws
Ext
IR IR
B
A
M
32A
L
U
32
32
op
IR
Y
M
IR
Dout
Data Memory
WE
Din
Addr
MemToReg
R
WE, MemToReg
ID (Decode) EX MEM WB
From WB
Instead of bypass: interleave threads on the pipeline to prevent stalls ...
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Krste
November 10, 2004
6.823, L18--3
Multithreading
How can we guarantee no dependencies between instructions in a pipeline?
-- One way is to interleave execution of instructions from different program threads on same pipeline
F D X M W
t0 t1 t2 t3 t4 t5 t6 t7 t8
T1: LW r1, 0(r2)
T2: ADD r7, r1, r4
T3: XORI r5, r4, #12
T4: SW 0(r7), r5
T1: LW r5, 12(r1)
t9
F D X M W
F D X M W
F D X M W
F D X M W
Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe
Last instruction
in a thread
always completes
writeback before
next instruction
in same thread
reads regfile
KrsteNovember 10, 2004
6.823, L18--5
Simple Multithreaded Pipeline
Have to carry thread select down pipeline to ensure correct state bits read/written at each pipe stage
+1
2 Thread
select
PC1
PC1
PC1
PC1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
Introduced in 1964 by Seymour Cray4 CPUs,each run at 1/4 clock
Many variants ...
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Multi-Threading(dynamic scheduling)
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Power 4 (predates Power 5 shown Tuesday)
● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.
● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.
● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.
Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate
Figure 4POWER4 instruction execution pipeline.
EA DC WB
MP ISS RF EX WB
MP ISS RF
MP ISS RF
F6
MP ISS RF
CP
LD/ST
FX
FP
WB
Fmt
D0
ICIF BP
EXD1 D2 D3 Xfer
Xfer
Xfer
GD
Branch redirects
Instruction fetch
Xfer
Xfer
BR
WB
Out-of-order processing
Instruction crack andgroup formation
Interrupts and flushes
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.
13
Single-threaded predecessor to Power 5. 8 execution units inout-of-order engine, each mayissue an instruction each cycle.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
For most apps, most execution units lie idle
Applications
alv
inn
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
From: Tullsen, Eggers, and Levy,“Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.
For an 8-way superscalar.Observation:
Most hardware in an
out-of-order CPU concerns
physical registers.
Could severalinstruction
threads share this hardware?
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Simultaneous Multi-threading ...
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleOne thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleTwo threads, 8 units
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
● Load hit store: A younger load that executes before anolder store to the same memory location has written itsdata to the caches must retrieve the data from the SDQ.As loads execute, they check the SRQ to see whetherthere is any older store to the same memory locationwith data in the SDQ. If one is found, the data isforwarded from the SDQ rather than from the cache. Ifthe data cannot be forwarded (as is the case if the loadand store instructions operate on overlapping memorylocations and the load data is not the same as orcontained within the store data), the group containingthe load instruction is flushed; that is, it and all youngergroups are discarded and refetched from the instructioncache. If we can tell that there is an older storeinstruction that will write to the same memory locationbut has yet to write its result to the SDQ, the loadinstruction is rejected and reissued, again waiting forthe store instruction to execute.
● Store hit load: If a younger load instruction executesbefore we have had a chance to recognize that an olderstore will be writing to the same memory location, theload instruction has received stale data. To guardagainst this, as a store instruction executes it checks theLRQ; if it finds a younger load that has executed andloaded from memory locations to which the store iswriting, the group containing the load instruction andall younger groups are flushed and refetched from theinstruction cache. To simplify the logic, all groupsfollowing the store are flushed. If the offending load isin the same group as the store instruction, the group isflushed, and all instructions in the group form single-instruction groups.
● Load hit load: Two loads to the same memory locationmust observe the memory reference order and preventa store to the memory location from another processorbetween the intervening loads. If the younger loadobtains old data, the older load must not obtainnew data. This requirement is called sequential loadconsistency. To guard against this, LRQ entries for allloads include a bit which, if set, indicates that a snoophas occurred to the line containing the loaded datafor that entry. When a load instruction executes, itcompares its load address against all addresses in theLRQ. A match against a younger entry which has beensnooped indicates that a sequential load consistencyproblem exists. To simplify the logic, all groupsfollowing the older load instruction are flushed. If bothload instructions are in the same group, the flushrequest is for the group itself. In this case, eachinstruction in the group when refetched forms a single-instruction group in order to avoid this situation thesecond time around.
Instruction execution pipelineFigure 4 shows the POWER4 instruction executionpipeline for the various pipelines. The IF, IC, and BPcycles correspond to the instruction-fetching and branch-prediction cycles. The D0 through GD cycles are thecycles during which instruction decode and groupformation occur. The MP cycle is the mapper cycle,in which all dependencies are determined, resourcesassigned, and the group dispatched into the appropriateissue queues. During the ISS cycle, the IOP is issued tothe appropriate execution unit, reads the appropriate
Figure 4POWER4 instruction execution pipeline.
EA DC WB
MP ISS RF EX WB
MP ISS RF
MP ISS RF
F6
MP ISS RF
CP
LD/ST
FX
FP
WB
Fmt
D0
ICIF BP
EXD1 D2 D3 Xfer
Xfer
Xfer
GD
Branch redirects
Instruction fetch
Xfer
Xfer
BR
WB
Out-of-order processing
Instruction crack andgroup formation
Interrupts and flushes
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. M. TENDLER ET AL.
13
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).
Power 4
Power 5
2 fetch (PC),2 initial decodes
2 commits(architected register sets)
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Power 5 data flow ...
The Power5 scans fetched instructions forbranches (BP stage), and if it finds a branch,predicts the branch direction using threebranch history tables shared by the twothreads. Two of the BHTs use bimodal andpath-correlated branch prediction mecha-nisms to predict branch directions.6,7 Thethird BHT predicts which of these predictionmechanisms is more likely to predict the cor-
rect direction.7 If the fetched instructions con-tain multiple branches, the BP stage can pre-dict all the branches at the same time. Inaddition to predicting direction, the Power5also predicts the target of a taken branch inthe current cycle’s eight-instruction group. Inthe PowerPC architecture, the processor cancalculate the target of most branches from theinstruction’s address and offset value. For
43MARCH–APRIL 2004
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF
XferF6
Group formation andinstruction decode
Instruction fetch
Branch redirects
Interrupts and flushes
WB
Fmt
D1 D2 D3 Xfer GD
BPICCP
D0
IF
Branchpipeline
Load/storepipeline
Fixed-pointpipeline
Floating-point pipeline
Out-of-order processing
Figure 3. Power5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA =compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, andCP = group commit).
Shared by two threads Thread 0 resources Thread 1 resources
LSU0FXU0
LSU1
FXU1
FPU0
FPU1
BXU
CRL
Dynamicinstructionselection
Threadpriority
Group formationInstruction decode
Dispatch
Shared-register
mappers
Readshared-
register files
Sharedissue
queues
Sharedexecution
units
Alternate
Branch prediction
Instructioncache
Instructiontranslation
Programcounter
Branchhistorytables
Returnstack
Targetcache
DataCache
DataTranslation
L2cache
Datacache
Datatranslation
Instructionbuffer 0
Instructionbuffer 1
Writeshared-
register files
Groupcompletion
Storequeue
Figure 4. Power5 instruction data flow (BXU = branch execution unit and CRL = condition register logical execution unit).Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to botteneck.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Power 5 thread performance ...
mode. In this mode, the Power5 gives all thephysical resources, including the GPR andFPR rename pools, to the active thread, allow-ing it to achieve higher performance than aPower4 system at equivalent frequencies.
The Power5 supports two types of ST oper-ation: An inactive thread can be in either adormant or a null state. From a hardware per-spective, the only difference between thesestates is whether or not the thread awakens onan external or decrementer interrupt. In thedormant state, the operating system boots upin SMT mode but instructs the hardware toput the thread into the dormant state whenthere is no work for that thread. To make adormant thread active, either the active threadexecutes a special instruction, or an externalor decrementer interrupt targets the dormantthread. The hardware detects these scenariosand changes the dormant thread to the activestate. It is software’s responsibility to restorethe architected state of a thread transitioningfrom the dormant to the active state.
When a thread is in the null state, the oper-ating system is unaware of the thread’s existence.As in the dormant state, the operating system
does not allocate resources to a null thread. Thismode is advantageous if all the system’s execut-ing tasks perform better in ST mode.
Dynamic power managementIn current CMOS technologies, chip power
has become one of the most important designparameters. With the introduction of SMT,more instructions execute per cycle per proces-sor core, thus increasing the core’s and thechip’s total switching power. To reduce switch-ing power, Power5 chips use a fine-grained,dynamic clock-gating mechanism extensively.This mechanism gates off clocks to a localclock buffer if dynamic power managementlogic knows the set of latches driven by thebuffer will not be used in the next cycle. Forexample, if the GPRs are guaranteed not tobe read in a given cycle, the clock-gatingmechanism turns off the clocks to the GPRread ports. This allows substantial power sav-ing with no performance impact.
In every cycle, the dynamic power man-agement logic determines whether a localclock buffer that drives a set of latches can beclock gated in the next cycle. The set of latch-es driven by a clock-gated local clock buffercan still be read but cannot be written. Weused power-modeling tools to estimate theutilization of various design macros and theirassociated switching power across a range ofworkloads. We then determined the benefitof clock gating for those macros, implement-ing cycle-by-cycle dynamic power manage-ment in macros where such managementprovided a reasonable power-saving benefit.We paid special attention to ensuring thatclock gating causes no performance loss andthat clock-gating logic does not create a crit-ical timing path. A minimum amount of logicimplements the clock-gating function.
In addition to switching power, leakagepower has become a performance limiter. Toreduce leakage power, the Power5 uses tran-sistors with low threshold voltage only in crit-ical paths, such as the FPR read path. Weimplemented the Power5 SRAM arrays main-ly with high threshold voltage devices.
The Power5 also has a low-power mode,enabled when the system software instructsthe hardware to execute both threads at thelowest available priority. In low-power mode,instructions dispatch once every 32 cycles at
46
HOT CHIPS 15
IEEE MICRO
Inst
ruct
ions
per
cyc
le (
IPC
)
Thread 0 priority, thread 1 priority
Powersavemode
Single-thread mode
Thread 0 IPC Thread 1 IPC
2,7 1,6
4,7 3,6 2,5 1,4
6,7 5,6 4,5 3,4 2,3 2,1
7,0 1,10,1 1,0
0,7 7,6 6,5 5,4 4,3 3,2 2,1
7,4 6,3 5,2 4,1
7,2 6,1
7,7 6,6 5,54,43,3 2,2
Figure 5. Effects of thread priority on performance.
Relative priority of each thread controllable in hardware.
For balanced operation, both threads run slower than if they “owned” the machine.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Reminder: Friday DRAM controller checkoff
DRAM
DRAM
Controller
IM Bus
DM Bus
Test
Vectors
Run your test vector suite on the Calinx board, display results on LEDs.
Test ongoing transactions on both buses, with randomized start times. Load, store, and verify different data word patterns.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Multi-Core
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Recall: Superscalar utilization by a thread
Applications
alv
inn
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
For an 8-way superscalar. Observation:
In many cases, the on-chip cache and DRAM I/O
bandwidth is also
underutilized by one CPU.
So, let 2 cores share them.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Most of Power 5 die is shared hardware
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
Core #1
Core #2
SharedComponents
L2 Cache
L3 Cache Control
DRAMController
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Core-to-core interactions stay on chip
supp
orts
a 1
.875
-Mby
te o
n-ch
ip L
2 ca
che.
Pow
er4
and
Pow
er4+
sys
tem
s bo
th h
ave
32-
Mby
te L
3 ca
ches
, whe
reas
Pow
er5
syst
ems
have
a 3
6-M
byte
L3
cach
e.T
he L
3 ca
che
oper
ates
as a
bac
kdoo
r with
sepa
rate
bus
es fo
r rea
ds a
nd w
rites
that
ope
r-at
e at
hal
f pr
oces
sor
spee
d. I
n Po
wer
4 an
dPo
wer
4+ sy
stem
s, th
e L3
was
an
inlin
e ca
che
for
data
ret
riev
ed fr
om m
emor
y. B
ecau
se o
fth
e hi
gher
tran
sisto
r de
nsity
of t
he P
ower
5’s
130-
nm te
chno
logy
, we c
ould
mov
e the
mem
-or
y co
ntro
ller
on c
hip
and
elim
inat
e a
chip
prev
ious
ly n
eede
d fo
r the
mem
ory
cont
rolle
rfu
nctio
n. T
hese
two
chan
ges
in th
e Po
wer
5al
so h
ave t
he si
gnifi
cant
side
ben
efits
of r
educ
-in
g la
tenc
y to
the
L3 c
ache
and
mai
n m
emo-
ry, a
s w
ell a
s re
duci
ng t
he n
umbe
r of
chi
psne
cess
ary
to b
uild
a sy
stem
.
Chip
overv
iewFi
gure
2 s
how
s th
e Po
wer
5 ch
ip,
whi
chIB
M f
abri
cate
s us
ing
silic
on-o
n-in
sula
tor
(SO
I) d
evic
es a
nd c
oppe
r int
erco
nnec
t. SO
Ite
chno
logy
red
uces
dev
ice
capa
cita
nce
toin
crea
se t
rans
isto
r pe
rfor
man
ce.5
Cop
per
inte
rcon
nect
dec
reas
es w
ire
resi
stan
ce a
ndre
duce
s de
lays
in w
ire-d
omin
ated
chi
p-tim
-
ing
path
s. I
n 13
0 nm
lith
ogra
phy,
the
chi
pus
es ei
ght m
etal
leve
ls an
d m
easu
res 3
89 m
m2 .
The
Pow
er5
proc
esso
r su
ppor
ts th
e 64
-bit
Pow
erPC
arc
hite
ctur
e. A
sin
gle
die
cont
ains
two
iden
tical
pro
cess
or co
res,
each
supp
ortin
gtw
o lo
gica
l thr
eads
. Thi
s ar
chite
ctur
e m
akes
the c
hip
appe
ar as
a fo
ur-w
ay sy
mm
etric
mul
-tip
roce
ssor
to th
e op
erat
ing
syst
em. T
he tw
oco
res s
hare
a 1
.875
-Mby
te (1
,920
-Kby
te) L
2ca
che.
We i
mpl
emen
ted
the L
2 ca
che a
s thr
eeid
entic
al s
lices
with
sep
arat
e co
ntro
llers
for
each
. The
L2
slice
s are
10-
way
set-
asso
ciat
ive
with
512
cong
ruen
ce cl
asse
s of 1
28-b
yte l
ines
.T
he d
ata’s
rea
l add
ress
det
erm
ines
whi
ch L
2sli
ce th
e dat
a is c
ache
d in
. Eith
er p
roce
ssor
core
can
inde
pend
ently
acc
ess e
ach
L2 c
ontr
olle
r.W
e al
so in
tegr
ated
the
dire
ctor
y fo
r an
off-
chip
36-
Mby
te L
3 ca
che o
n th
e Pow
er5
chip
.H
avin
g th
e L3
cach
e dire
ctor
y on
chip
allo
ws
the
proc
esso
r to
che
ck th
e di
rect
ory
afte
r an
L2 m
iss w
ithou
t exp
erie
ncin
g of
f-ch
ip d
elay
s.To
red
uce
mem
ory
late
ncie
s, w
e in
tegr
ated
the m
emor
y co
ntro
ller o
n th
e chi
p. T
his e
lim-
inat
es d
rive
r an
d re
ceiv
er d
elay
s to
an
exte
r-na
l con
trol
ler.
Proce
ssor c
oreW
e de
signe
d th
e Po
wer
5 pr
oces
sor c
ore
tosu
ppor
t bo
th e
nhan
ced
SMT
and
sin
gle-
thre
aded
(ST
) op
erat
ion
mod
es.
Figu
re 3
show
s th
e Po
wer
5’s
inst
ruct
ion
pipe
line,
whi
ch is
iden
tical
to th
e Pow
er4’
s. A
ll pi
pelin
ela
tenc
ies i
n th
e Pow
er5,
incl
udin
g th
e bra
nch
misp
redi
ctio
n pe
nalty
and
load
-to-
use
late
n-cy
with
an
L1 d
ata
cach
e hi
t, ar
e th
e sa
me
asin
the
Pow
er4.
The
iden
tical
pip
elin
e st
ruc-
ture
lets
opt
imiz
atio
ns d
esig
ned
for
Pow
er4-
base
d sy
stem
s pe
rfor
m
equa
lly
wel
l on
Pow
er5-
base
d sy
stem
s. F
igur
e 4
show
s th
ePo
wer
5’s i
nstr
uctio
n flo
w d
iagr
am.
In S
MT
mod
e, th
e Po
wer
5 us
es tw
o se
pa-
rate
inst
ruct
ion
fetc
h ad
dres
s reg
ister
s to
stor
eth
e pr
ogra
m c
ount
ers
for
the
two
thre
ads.
Inst
ruct
ion
fetc
hes
(IF
stag
e)
alte
rnat
ebe
twee
n th
e tw
o th
read
s. I
n ST
mod
e, t
hePo
wer
5 us
es o
nly
one
prog
ram
cou
nter
and
can
fetc
h in
stru
ctio
ns fo
r th
at t
hrea
d ev
ery
cycl
e. I
t ca
n fe
tch
up t
o ei
ght
inst
ruct
ions
from
the
inst
ruct
ion
cach
e (I
C s
tage
) ev
ery
cycl
e. T
he tw
o th
read
s sh
are
the
inst
ruct
ion
cach
e an
d th
e in
stru
ctio
n tr
ansla
tion
faci
lity.
In a
give
n cy
cle,
all f
etch
ed in
stru
ctio
ns co
me
from
the
sam
e th
read
.
42
HOT
CHIP
S15
IEEE M
ICRO
Figu
re 2
. Pow
er5
chip
(FXU
= fi
xed-
poin
t exe
cutio
n un
it, IS
U=
inst
ruct
ion
sequ
enci
ng u
nit,
IDU
= in
stru
ctio
n de
code
uni
t,LS
U =
load
/sto
re u
nit,
IFU
= in
stru
ctio
n fe
tch
unit,
FPU
=flo
atin
g-po
int u
nit,
and
MC
= m
emor
y co
ntro
ller).
(2) Threads on two cores share memory via L2 cache operations.Much faster than2 CPUs on 2 chips.
(1) Threads on two cores that use shared libraries conserve L2 memory.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Sun Niagara
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
The case for Sun’s Niagara ...
Applications
alv
inn
dodu
c
eqnto
tt
espre
sso
fpppp
hydro
2d li
mdlj
dp2
mdlj
sp2
nas
a7 ora
su2co
r
swm
tom
catv
100
90
80
70
60
50
40
30
20
10
0
com
posi
te
itlb miss
dtlb miss
dcache miss
processor busy
icache miss
branch misprediction
control hazards
load delays
short integer
long integer
short fp
long fp
memory conflict
Per
cent
of
Tota
l Is
sue
Cycl
es
Figure 2: Sources of all unused issue cycles in an 8-issue superscalar processor. Processor busy represents the utilized issue slots; all
others represent wasted issue slots.
such as an I tlb miss and an I cache miss, the wasted cycles are
divided up appropriately. Table 3 specifies all possible sources
of wasted cycles in our model, and some of the latency-hiding or
latency-reducing techniques that might apply to them. Previous
work [32, 5, 18], in contrast, quantified some of these same effects
by removing barriers to parallelism and measuring the resulting
increases in performance.
Our results, shown in Figure 2, demonstrate that the functional
units of our wide superscalar processor are highly underutilized.
From the composite results bar on the far right, we see a utilization
of only 19% (the “processor busy” component of the composite bar
of Figure 2), which represents an average execution of less than 1.5
instructions per cycle on our 8-issue machine.
These results also indicate that there is no dominant source of
wasted issue bandwidth. Although there are dominant items in
individual applications (e.g., mdljsp2, swm, fpppp), the dominant
cause is different in each case. In the composite results we see that
the largest cause (short FP dependences) is responsible for 37% of
the issue bandwidth, but there are six other causes that account for
at least 4.5% of wasted cycles. Even completely eliminating any
one factor will not necessarily improve performance to the degree
that this graph might imply, because many of the causes overlap.
Not only is there no dominant cause of wasted cycles — there
appears to be no dominant solution. It is thus unlikely that any single
latency-tolerating technique will produce a dramatic increase in the
performance of these programs if it only attacks specific types of
latencies. Instruction scheduling targets several important segments
of the wasted issue bandwidth, but we expect that our compiler
has already achieved most of the available gains in that regard.
Current trends have been to devote increasingly larger amounts of
on-chip area to caches, yet even if memory latencies are completely
eliminated, we cannot achieve 40% utilization of this processor. If
specific latency-hiding techniques are limited, then any dramatic
increase in parallelism needs to come from a general latency-hiding
solution, of which multithreading is an example. The different types
of multithreading have the potential to hide all sources of latency,
but to different degrees.
This becomes clearer if we classify wasted cycles as either vertical
For an 8-way superscalar. Observation:
Some apps struggle to
reach a CPI <= 1.
For throughput on these apps,a large number of single-issue cores is better
than a few superscalars.
Note: This equation is correct.
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Niagara: 32 threads on one chip8 cores:
Single-issue6-stage pipeline4-way multi-threadedFast crypto support
Shared resources:3MB on-chip cache4 DDR2 interfaces32G DRAM, 20 Gb/s1 shared FP unitGB Ethernet ports
Sources: Hot Chips, via EE Times, Infoworld. J Schwartz weblog (Sun COO)
Die size: 340 mm² in 90 nm.Power: 50-60 W
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Niagara status: First motherboard runs
Source: J Schwartz weblog (Sun COO)
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Cell: The PS3 chip
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Synergistic Processing
Units(SPUs)
8 cores using local memory, not traditional caches
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
One Synergistic Processing Unit (SPU)
256 KB Local Store -- 128 128-bit RegistersSPU issues 2 inst/cycle (in order) to 7 execution unitsSPU fills Local Store using DMA to DRAM and network
Programmers manage caching explicitly
UC Regents Spring 2005 © UCBCS 152 L22: Advanced Processors III
Simultaneous Multithreading: Instructions streams can sharean out-of-order engine economically.
Multi-core: Once instruction-level parallelism run dry, thread-level parallelism is a good use of die area.
Conclusions: Throughput processing