design of parallel and high-performance computing · pc. i$ e ss b int.ex d$ e ss b c ommi t fp f f...
Post on 24-Jun-2020
0 Views
Preview:
TRANSCRIPT
Design of Parallel and High-PerformanceComputingFall 2017Lecture: Cache Coherence & Memory Models
Instructor: Torsten Hoefler & Markus Püschel
TAs: Salvatore Di Girolamo
Motivational video: https://www.youtube.com/watch?v=zJybFF6PqEQ
Sorry for missing last week’s lecture!
2
To talk about some silliness in MPI-3
3
Stadt- und Klimawandel: Wie
stellen wir uns den
Herausforderungen?
Mittwoch, 8. November 2016, 15.00 – 19.30 Uhr
ETH Zürich, Hauptgebäude
Informationen und Anmeldung:
http://www.c2sm.ethz.ch/events/eth-klimarunde-2017.html
A more complete view
6
large cache-
coherent multicore
machines
communicating
through coherent
memory access
and remote direct
memory access
Architecture Developments
’00-’05<1999 ’06-’12 ’13-’20 >2020
distributed
memory
machines
communicating
through
messages
large cache-
coherent multicore
machines
communicating
through coherent
memory access
and messages
coherent and non-
coherent
manycore
accelerators and
multicores
communicating
through memory
access and remote
direct memory
access
largely non-
coherent
accelerators and
multicores
communicating
through remote
direct memory
access
Sources: various vendors7
Physics (technological constraints) Cost of data movement
Capacity of DRAM cells
Clock frequencies (constrained by end of Dennard scaling)
Speed of Light
Melting point of silicon
Computer Architecture (design of the machine) Power management
ISA / Multithreading
SIMD widths
“Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints.” – Fred Brooks (IBM, 1962)
Have converted many former “power” problems into “cost” problems
Computer Architecture vs. Physics
8Credit: John Shalf (LBNL)
Low-Power Design Principles (2005)
Cubic power improvement with lower clock rate due to V2F
Slower clock rates enable use of simpler cores
Simpler cores use less area (lower leakage) and reduce cost
Tailor design to application to REDUCE WASTE
Intel Core2
Intel Atom
Tensilica XTensa
Power 5
9Credit: John Shalf (LBNL)
Low-Power Design Principles (2005)
Intel Core2
Intel Atom
Tensilica XTensa
Power 5
Power5 (server)
120W@1900MHz
Baseline
Intel Core2 sc (laptop) :
15W@1000MHz
4x more FLOPs/watt than baseline
Intel Atom (handhelds)
0.625W@800MHz
80x more
GPU Core or XTensa/Embedded
0.09W@600MHz
400x more (80x-120x sustained)
10Credit: John Shalf (LBNL)
Low-Power Design Principles (2005)Tensilica XTensa
Power5 (server)
120W@1900MHz
Baseline
Intel Core2 sc (laptop) :
15W@1000MHz
4x more FLOPs/watt than baseline
Intel Atom (handhelds)
0.625W@800MHz
80x more
GPU Core or XTensa/Embedded
0.09W@600MHz
400x more (80x-120x sustained)
Even if each simple core is 1/4th as computationally efficient as complex core, you
can fit hundreds of them on a single chip and still be 100x more power efficient.
11Credit: John Shalf (LBNL)
Heterogeneous Future (LOCs and TOCs)
0.23mm
0.2 mm
Tiny coreBig cores (very few)Lots of them!
Latency Optimized Core
(LOC)
Most energy efficient if you
don’t have lots of parallelism
Throughput Optimized Core
(TOC)
Most energy efficient if you DO
have a lot of parallelism!
12Credit: John Shalf (LBNL)
Energy Efficiency of copper wire: Power = Frequency * Length / cross-section-area
Wire efficiency does not improve as feature size shrinks
Energy Efficiency of a Transistor: Power = V2 * Frequency * Capacitance
Capacitance ~= Area of Transistor
Transistor efficiency improves as you shrink it
Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics)
wire
Photonics could break through the bandwidth-distance limit
Data movement – the wires
13Credit: John Shalf (LBNL)
Moore’s law doesn’t apply to adding pins to package
30%+ per year nominal Moore’s Law
Pins grow at ~1.5-3% per year at best
4000 Pins is aggressive pin package
Half of those would need to be for power and ground
Of the remaining 2k pins, run as differential pairs
Beyond 15Gbps per pin power/complexity costs hurt!
10Gpbs * 1k pins is ~1.2TBytes/sec
2.5D Integration gets boost in pin density
But it’s a 1 time boost (how much headroom?)
4TB/sec? (maybe 8TB/s with single wire signaling?)
Pin Limits
14Credit: John Shalf (LBNL)
Die Photos (3 classes of cores)
A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/WRISC-V Processor with Vector Accelerators
Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†,Vladimir Stojanovic⇤†, Krste Asanovic⇤
Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, sunchen@mit.edu⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
†Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract—A 64-bit dual-core RISC-V processor with vectoraccelerators has been fabr icated in a 45nm SOI process. Thisis the first dual-core processor to implement the open-sourceRISC-V ISA designed at the University of California, Berkeley.In a standard 40nm process, the RISC-V scalar core scores 10%higher in DMIPS/MHz than the Cortex-A5, ARM’s comparablesingle-issue in-order scalar core, and is 49% more area-efficient.To demonstrate the extensibility of the RISC-V ISA, we integratea custom vector accelerator alongside each single-issue in-orderscalar core. The vector accelerator is 1.8⇥ more energy-efficientthan the IBM Blue Gene/Q processor, and 2.6⇥ more than theIBM Cell processor, both fabr icated in the same process. Thedual-core RISC-V processor achieves maximum clock frequencyof 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2 .
I . INTRODUCTION
As we approach the end of conventional transistor scaling,computer architects are forced to incorporate specialized andheterogeneous accelerators into general-purpose processors forgreater energy efficiency. Many proposed accelerators, such asthose based on GPU architectures, require a drastic reworkingof application software to make use of separate ISAs operatingin memory spaces disjoint from the demand-paged virtualmemory of the host CPU. RISC-V [1] is a new completelyopen general-purpose instruction set architecture (ISA) de-veloped at the University of California, Berkeley, which isdesigned to be flexible and extensible to better integrate newefficient accelerators close to the host cores. The open-sourceRISC-V software toolchain includes a GCC cross-compiler,an LLVM cross-compiler, a software ISA simulator, an ISAverification suite, a Linux port, and additional documentation,and is available at www. r i scv. or g.
In this paper, we present a 64-bit dual-core RISC-Vprocessor with custom vector accelerators in a 45nm SOIprocess. Our RISC-V scalar core achieves 1.72DMIPS/MHz,outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHzby 10% in a smaller footprint. Our custom vector acceleratoris 1.8⇥ more energy-efficient than the IBM Blue Gene/Qprocessor and 2.6⇥ more than the IBM Cell processor fordouble-precision floating-point operations, demonstrating thathigh efficiency can be obtained without sacrificing a unifieddemand-paged virtual memory environment.
I I . CHIP ARCHITECTURE
Figure 1 shows the block diagram of the dual-core pro-cessor. Each core incorporates a 64-bit single-issue in-orderRocket scalar core, a 64-bit Hwacha vector accelerator, andtheir associated instruction and data caches, as describedbelow.
Dual-Core RISC-V Vector Processor
Core
1MB
SRAM
Array
L1D$VRF
Core Logic
3mm
6m
m
2.8mm
1.1
mm
Coherence Hub
FPGA FSB/
HTIF1MB SRAM Array
L1I$
L1VI$
Rocket
Scalar
Core
Hwacha
Vector
Accelerator
16K
L1I$
32K
L1D$
8KB
L1VI$
Arbiter
Rocket
Scalar
Core
Hwacha
Vector
Accelerator
16K
L1I$
32K
L1D$
8KB
L1VI$
Arbiter
Fig. 1. Backside chip micrograph (taken with a removed silicon handle) andprocessor block diagram.
A. Rocket Scalar Core
Rocket is a 6-stage single-issue in-order pipeline thatexecutes the 64-bit scalar RISC-V ISA (see Figure 2). Thescalar datapath is fully bypassed but carefully designed to min-imize the impact of long clock-to-output delays of compiler-generated SRAMs in the caches. A 64-entry branch targetbuffer, 256-entry two-level branch predictor, and return addressstack together mitigate the performance impact of controlhazards. Rocket implements an MMU that supports page-based virtual memory and is able to boot modern operatingsystems, including Linux. Both caches are virtually indexedphysically tagged with parallel TLB lookups. The data cacheis non-blocking, allowing the core to exploit memory-levelparallelism.
Rocket has an optional IEEE 754-2008-compliant FPU,which can execute single- and double-precision floating-pointoperations, including fused multiply-add (FMA), with hard-ware support for denormals and other exceptional values. The
PC
Gen.I$
Access
ITLB
Int.EX D$
Access
DTLB
Commit
FP.RF FP.EX1 FP.EX2 FP.EX3
Inst.
Decode
Int.RF
VI$
Access
VITLBVInst.
Decode
Seq-
uencerExpand
PC
Gen.Bank1
R/W
Bank8
R/W...
Rocket Pipeline
to Hwacha
from Rocket Hwacha Pipeline
bypass paths omitted
for simplicity
Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.
8.4mm
20mm
1.2mm
0.5 mm
0.23mm
0.2 mm
15Credit: John Shalf (LBNL)
Strip down to the core
A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/WRISC-V Processor with Vector Accelerators
Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†,Vladimir Stojanovic⇤†, Krste Asanovic⇤
Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, sunchen@mit.edu⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
†Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Abstract—A 64-bit dual-core RISC-V processor with vectoraccelerators has been fabr icated in a 45nm SOI process. Thisis the first dual-core processor to implement the open-sourceRISC-V ISA designed at the University of California, Berkeley.In a standard 40nm process, the RISC-V scalar core scores 10%higher in DMIPS/MHz than the Cortex-A5, ARM’s comparablesingle-issue in-order scalar core, and is 49% more area-efficient.To demonstrate the extensibility of the RISC-V ISA, we integratea custom vector accelerator alongside each single-issue in-orderscalar core. The vector accelerator is 1.8⇥ more energy-efficientthan the IBM Blue Gene/Q processor, and 2.6⇥ more than theIBM Cell processor, both fabr icated in the same process. Thedual-core RISC-V processor achieves maximum clock frequencyof 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2 .
I . INTRODUCTION
As we approach the end of conventional transistor scaling,computer architects are forced to incorporate specialized andheterogeneous accelerators into general-purpose processors forgreater energy efficiency. Many proposed accelerators, such asthose based on GPU architectures, require a drastic reworkingof application software to make use of separate ISAs operatingin memory spaces disjoint from the demand-paged virtualmemory of the host CPU. RISC-V [1] is a new completelyopen general-purpose instruction set architecture (ISA) de-veloped at the University of California, Berkeley, which isdesigned to be flexible and extensible to better integrate newefficient accelerators close to the host cores. The open-sourceRISC-V software toolchain includes a GCC cross-compiler,an LLVM cross-compiler, a software ISA simulator, an ISAverification suite, a Linux port, and additional documentation,and is available at www. r i scv. or g.
In this paper, we present a 64-bit dual-core RISC-Vprocessor with custom vector accelerators in a 45nm SOIprocess. Our RISC-V scalar core achieves 1.72DMIPS/MHz,outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHzby 10% in a smaller footprint. Our custom vector acceleratoris 1.8⇥ more energy-efficient than the IBM Blue Gene/Qprocessor and 2.6⇥ more than the IBM Cell processor fordouble-precision floating-point operations, demonstrating thathigh efficiency can be obtained without sacrificing a unifieddemand-paged virtual memory environment.
I I . CHIP ARCHITECTURE
Figure 1 shows the block diagram of the dual-core pro-cessor. Each core incorporates a 64-bit single-issue in-orderRocket scalar core, a 64-bit Hwacha vector accelerator, andtheir associated instruction and data caches, as describedbelow.
Dual-Core RISC-V Vector Processor
Core
1MB
SRAM
Array
L1D$VRF
Core Logic
3mm
6m
m
2.8mm
1.1
mm
Coherence Hub
FPGA FSB/
HTIF1MB SRAM Array
L1I$
L1VI$
Rocket
Scalar
Core
Hwacha
Vector
Accelerator
16K
L1I$
32K
L1D$
8KB
L1VI$
Arbiter
Rocket
Scalar
Core
Hwacha
Vector
Accelerator
16K
L1I$
32K
L1D$
8KB
L1VI$
Arbiter
Fig. 1. Backside chip micrograph (taken with a removed silicon handle) andprocessor block diagram.
A. Rocket Scalar Core
Rocket is a 6-stage single-issue in-order pipeline thatexecutes the 64-bit scalar RISC-V ISA (see Figure 2). Thescalar datapath is fully bypassed but carefully designed to min-imize the impact of long clock-to-output delays of compiler-generated SRAMs in the caches. A 64-entry branch targetbuffer, 256-entry two-level branch predictor, and return addressstack together mitigate the performance impact of controlhazards. Rocket implements an MMU that supports page-based virtual memory and is able to boot modern operatingsystems, including Linux. Both caches are virtually indexedphysically tagged with parallel TLB lookups. The data cacheis non-blocking, allowing the core to exploit memory-levelparallelism.
Rocket has an optional IEEE 754-2008-compliant FPU,which can execute single- and double-precision floating-pointoperations, including fused multiply-add (FMA), with hard-ware support for denormals and other exceptional values. The
PC
Gen.I$
Access
ITLB
Int.EX D$
Access
DTLB
Commit
FP.RF FP.EX1 FP.EX2 FP.EX3
Inst.
Decode
Int.RF
VI$
Access
VITLBVInst.
Decode
Seq-
uencerExpand
PC
Gen.Bank1
R/W
Bank8
R/W...
Rocket Pipeline
to Hwacha
from Rocket Hwacha Pipeline
bypass paths omitted
for simplicity
Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.
2.7 mm
4.5 mm
1.2mm
0.5 mm
0.23mm
0.2 mm
16Credit: John Shalf (LBNL)
Actual Size
A45nm
1.3
GH
z16.7
Double-P
recision
GFL
OPS/W
RIS
C-V
Pro
cessor
with
Vecto
rA
ccelerators
Yunsu
pL
ee⇤,
Andrew
Waterm
an⇤,
Rim
asA
vizien
is⇤,
Hen
ryC
ook⇤,
Chen
Sun⇤†,
Vlad
imir
Sto
janovic⇤†,
Krste
Asan
ovic⇤
Em
ail:{yunsu
p,
waterm
an,
rimas,
hco
ok,
vlad
a,krste}
@eecs.b
erkeley.ed
u,
sunch
en@
mit.ed
u⇤D
epartm
ent
of
Electrical
Engin
eering
and
Com
puter
Scien
ces,U
niv
ersityof
Califo
rnia,
Berk
eley,C
A,
USA
†Dep
artmen
tof
Electrical
Engin
eering
and
Com
puter
Scien
ce,M
assachusetts
Institu
teof
Tech
nolo
gy,
Cam
brid
ge,
MA
,U
SA
Abstra
ct—A
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
vecto
raccelera
tors
has
been
fabrica
tedin
a45
nm
SO
Ipro
cess.T
his
isth
efirst
dual-co
repro
cessor
toim
plem
ent
the
open
-source
RIS
C-V
ISA
desig
ned
at
the
Univ
ersityof
Califo
rnia
,B
erkeley.
Ina
standard
40
nm
pro
cess,th
eR
ISC
-Vsca
lar
core
scores
10%
hig
her
inD
MIP
S/M
Hz
than
the
Cortex
-A5,ARM’s
com
para
ble
single-issu
ein
-ord
ersca
lar
core,
and
is49%
more
area
-efficien
t.T
odem
onstra
teth
eex
tensib
ilityof
the
RIS
C-V
ISA
,w
ein
tegra
tea
custo
mvecto
raccelera
tor
alo
ngsid
eea
chsin
gle-issu
ein
-ord
ersca
lar
core.
The
vecto
raccelera
tor
is1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor,
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor,
both
fabrica
tedin
the
sam
epro
cess.T
he
dual-co
reR
ISC
-Vpro
cessor
ach
ieves
maxim
um
clock
frequen
cyof
1.3
GH
zat
1.2
Vand
pea
ken
ergyeffi
ciency
of
16.7
double-
precisio
nG
FL
OP
S/W
at
0.6
5V
with
an
area
of
3m
m2.
I.IN
TR
OD
UC
TIO
N
As
we
appro
achth
een
dof
conven
tional
transisto
rscalin
g,
com
puter
architects
arefo
rcedto
inco
rporate
specialized
and
hetero
gen
eous
accelerators
into
gen
eral-purp
ose
pro
cessors
for
greater
energ
yeffi
ciency.
Man
ypro
posed
accelerators,
such
asth
ose
based
on
GPU
architectu
res,req
uire
adrastic
rework
ing
of
applicatio
nso
ftware
tom
ake
use
of
separate
ISA
soperatin
gin
mem
ory
spaces
disjo
int
from
the
dem
and-p
aged
virtu
alm
emory
of
the
host
CPU
.R
ISC
-V[1
]is
anew
com
pletely
open
gen
eral-purp
ose
instru
ction
setarch
itecture
(ISA
)de-
velo
ped
atth
eU
niv
ersityof
Califo
rnia,
Berk
eley,w
hich
isdesig
ned
tobeflex
ible
and
exten
sible
tobetter
integ
ratenew
efficien
taccelerato
rsclo
seto
the
host
cores.
The
open
-source
RIS
C-V
softw
areto
olch
ainin
cludes
aG
CC
cross-co
mpiler,
anL
LV
Mcro
ss-com
piler,
aso
ftware
ISA
simulato
r,an
ISA
verifi
cation
suite,
aL
inux
port,
and
additio
nal
docu
men
tation,
and
isav
ailable
atw
ww
.r
is
cv
.o
rg
.
Inth
ispap
er,w
epresen
ta
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
custo
mvecto
raccelerato
rsin
a45
nm
SO
Ipro
cess.O
ur
RIS
C-V
scalarco
reach
ieves
1.7
2D
MIP
S/M
Hz,
outp
erform
ingARM’s
Cortex
-A5
score
of
1.5
7D
MIP
S/M
Hz
by
10%
ina
smaller
footp
rint.
Our
custo
mvecto
raccelerato
ris
1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor
for
double-p
recisionfloatin
g-point
operatio
ns,
dem
onstratin
gth
athig
heffi
ciency
canbe
obtain
edw
ithoutsacrifi
cing
aunified
dem
and-p
aged
virtu
alm
emory
enviro
nm
ent.
II.C
HIP
AR
CH
ITE
CT
UR
E
Fig
ure
1sh
ow
sth
eblo
ckdiag
ramof
the
dual-co
repro
-cesso
r.E
achco
rein
corp
orates
a64-b
itsin
gle-issu
ein
-ord
erR
ock
etscalar
core,
a64-b
itH
wach
avecto
raccelerato
r,an
dth
eirasso
ciatedin
structio
nan
ddata
caches,
asdescrib
edbelo
w.
Du
al-C
ore
RIS
C-V
V
ecto
r Pro
cesso
r
Co
re
1M
B
SR
AM
Arra
y
L1
D$
VR
F
Co
re L
og
ic
3m
m
6mm
2.8
mm
1.1mm
Cohere
nce
Hub
FP
GA
FS
B/
HT
IF1M
B S
RA
M A
rray
L1
I$
L1
VI$
Rocke
t
Sca
lar
Core
Hw
acha
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Rocke
t
Sca
lar
Core
Hw
ach
a
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Fig
.1.
Back
side
chip
micro
grap
h(tak
enw
itha
removed
silicon
han
dle)
and
pro
cessor
blo
ckdiag
ram.
A.
Rocket
Sca
lar
Core
Rock
etis
a6-stag
esin
gle-issu
ein
-ord
erpip
eline
that
execu
testh
e64-b
itscalar
RIS
C-V
ISA
(seeFig
ure
2).
The
scalardatap
athis
fully
bypassed
butcarefu
llydesig
ned
tom
in-
imize
the
impact
of
long
clock
-to-o
utp
ut
delay
sof
com
piler-
gen
eratedSR
AM
sin
the
caches.
A64-en
trybran
chtarg
etbuffer,
256-en
trytw
o-lev
elbran
chpred
ictor,
and
return
address
stackto
geth
erm
itigate
the
perfo
rman
ceim
pact
of
contro
lhazard
s.R
ock
etim
plem
ents
anM
MU
that
supports
pag
e-based
virtu
alm
emory
and
isab
leto
boot
modern
operatin
gsy
stems,
inclu
din
gL
inux.
Both
caches
arevirtu
allyin
dex
edphysically
tagged
with
parallel
TL
Blo
okups.
The
data
cache
isnon-b
lock
ing,
allow
ing
the
core
toex
plo
itm
emory
-level
parallelism
.
Rock
ethas
anoptio
nal
IEE
E754-2
008-co
mplian
tFPU
,w
hich
canex
ecute
single-
and
double-p
recisionfloatin
g-point
operatio
ns,
inclu
din
gfu
sedm
ultip
ly-ad
d(F
MA
),w
ithhard
-w
aresu
pport
for
den
orm
alsan
doth
erex
ceptio
nal
valu
es.T
he
PC
Gen.
I$
Acce
ss
ITLB
Int.E
XD
$
Acce
ss
DT
LB
Com
mit
FP.R
FF
P.E
X1
FP.E
X2
FP.E
X3
Inst.
De
co
de
Int.R
F
VI$
Acce
ss
VIT
LB
VIn
st.
De
co
de
Se
q-
uence
rE
xp
and
PC
Gen.
Ba
nk1
R/W
Ba
nk8
R/W
...
Rocke
t Pip
elin
e
to H
wa
ch
a
from
Ro
cke
tH
wach
a P
ipelin
e
byp
ass p
ath
s o
mitte
d
for s
imp
licity
Fig
.2.
Rock
etscalar
plu
sH
wach
avecto
rpip
eline
diag
ram.
4.5 mm
2.7mm
0.23mm
0.2 mm
0.5mm 1.2 mm
17Credit: John Shalf (LBNL)
Basic Stats
0.23mm
0.2 mm
Area: 12.25 mm2
Power: 2.5WClock: 2.4 GHzE/op: 651 pj
Area: 0.6 mm2
Power: 0.3W (<0.2W)Clock: 1.3 GHzE/op: 150 (75) pj
Core Energy/Area est.
Area: 0.046 mm2
Power: 0.025WClock: 1.0 GHzE/op: 22 pj
Wire EnergyAssumptions for 22nm
100 fj/bit per mm64bit operand
Energy:1mm=~6pj20mm=~120pj
A45nm
1.3
GH
z16.7
Double-P
recision
GFL
OPS/W
RIS
C-V
Pro
cessor
with
Vecto
rA
ccelerators
Yunsu
pL
ee⇤,
Andrew
Waterm
an⇤,
Rim
asA
vizien
is⇤,
Hen
ryC
ook⇤,
Chen
Sun⇤†,
Vlad
imir
Sto
janovic⇤†,
Krste
Asan
ovic⇤
Em
ail:{yunsu
p,
waterm
an,
rimas,
hco
ok,
vlad
a,krste}
@eecs.b
erkeley.ed
u,
sunch
en@
mit.ed
u⇤D
epartm
ent
of
Electrical
Engin
eering
and
Com
puter
Scien
ces,U
niv
ersityof
Califo
rnia,
Berk
eley,C
A,
USA
†Dep
artmen
tof
Electrical
Engin
eering
and
Com
puter
Scien
ce,M
assachusetts
Institu
teof
Tech
nolo
gy,
Cam
brid
ge,
MA
,U
SA
Abstra
ct—A
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
vecto
raccelera
tors
has
been
fabrica
tedin
a45
nm
SO
Ipro
cess.T
his
isth
efirst
dual-co
repro
cessor
toim
plem
ent
the
open
-source
RIS
C-V
ISA
desig
ned
at
the
Univ
ersityof
Califo
rnia
,B
erkeley.
Ina
standard
40
nm
pro
cess,th
eR
ISC
-Vsca
lar
core
scores
10%
hig
her
inD
MIP
S/M
Hz
than
the
Cortex
-A5,ARM’s
com
para
ble
single-issu
ein
-ord
ersca
lar
core,
and
is49%
more
area
-efficien
t.T
odem
onstra
teth
eex
tensib
ilityof
the
RIS
C-V
ISA
,w
ein
tegra
tea
custo
mvecto
raccelera
tor
alo
ngsid
eea
chsin
gle-issu
ein
-ord
ersca
lar
core.
The
vecto
raccelera
tor
is1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor,
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor,
both
fabrica
tedin
the
sam
epro
cess.T
he
dual-co
reR
ISC
-Vpro
cessor
ach
ieves
maxim
um
clock
frequen
cyof
1.3
GH
zat
1.2
Vand
pea
ken
ergyeffi
ciency
of
16.7
double-
precisio
nG
FL
OP
S/W
at
0.6
5V
with
an
area
of
3m
m2.
I.IN
TR
OD
UC
TIO
N
As
we
appro
achth
een
dof
conven
tional
transisto
rscalin
g,
com
puter
architects
arefo
rcedto
inco
rporate
specialized
and
hetero
gen
eous
accelerators
into
gen
eral-purp
ose
pro
cessors
for
greater
energ
yeffi
ciency.
Man
ypro
posed
accelerators,
such
asth
ose
based
on
GPU
architectu
res,req
uire
adrastic
rework
ing
of
applicatio
nso
ftware
tom
ake
use
of
separate
ISA
soperatin
gin
mem
ory
spaces
disjo
int
from
the
dem
and-p
aged
virtu
alm
emory
of
the
host
CPU
.R
ISC
-V[1
]is
anew
com
pletely
open
gen
eral-purp
ose
instru
ction
setarch
itecture
(ISA
)de-
velo
ped
atth
eU
niv
ersityof
Califo
rnia,
Berk
eley,w
hich
isdesig
ned
tobeflex
ible
and
exten
sible
tobetter
integ
ratenew
efficien
taccelerato
rsclo
seto
the
host
cores.
The
open
-source
RIS
C-V
softw
areto
olch
ainin
cludes
aG
CC
cross-co
mpiler,
anL
LV
Mcro
ss-com
piler,
aso
ftware
ISA
simulato
r,an
ISA
verifi
cation
suite,
aL
inux
port,
and
additio
nal
docu
men
tation,
and
isav
ailable
atw
ww
.r
is
cv
.o
rg
.
Inth
ispap
er,w
epresen
ta
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
custo
mvecto
raccelerato
rsin
a45
nm
SO
Ipro
cess.O
ur
RIS
C-V
scalarco
reach
ieves
1.7
2D
MIP
S/M
Hz,
outp
erform
ingARM’s
Cortex
-A5
score
of
1.5
7D
MIP
S/M
Hz
by
10%
ina
smaller
footp
rint.
Our
custo
mvecto
raccelerato
ris
1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor
for
double-p
recisionfloatin
g-point
operatio
ns,
dem
onstratin
gth
athig
heffi
ciency
canbe
obtain
edw
ithoutsacrifi
cing
aunified
dem
and-p
aged
virtu
alm
emory
enviro
nm
ent.
II.C
HIP
AR
CH
ITE
CT
UR
E
Fig
ure
1sh
ow
sth
eblo
ckdiag
ramof
the
dual-co
repro
-cesso
r.E
achco
rein
corp
orates
a64-b
itsin
gle-issu
ein
-ord
erR
ock
etscalar
core,
a64-b
itH
wach
avecto
raccelerato
r,an
dth
eirasso
ciatedin
structio
nan
ddata
caches,
asdescrib
edbelo
w.
Du
al-C
ore
RIS
C-V
V
ecto
r Pro
cesso
r
Co
re
1M
B
SR
AM
Arra
y
L1
D$
VR
F
Co
re L
og
ic
3m
m
6mm
2.8
mm
1.1mm
Cohere
nce
Hub
FP
GA
FS
B/
HT
IF1M
B S
RA
M A
rray
L1
I$
L1
VI$
Rocke
t
Sca
lar
Core
Hw
acha
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Rocke
t
Sca
lar
Core
Hw
ach
a
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Fig
.1.
Back
side
chip
micro
grap
h(tak
enw
itha
removed
silicon
han
dle)
and
pro
cessor
blo
ckdiag
ram.
A.
Rocket
Sca
lar
Core
Rock
etis
a6-stag
esin
gle-issu
ein
-ord
erpip
eline
that
execu
testh
e64-b
itscalar
RIS
C-V
ISA
(seeFig
ure
2).
The
scalardatap
athis
fully
bypassed
butcarefu
llydesig
ned
tom
in-
imize
the
impact
of
long
clock
-to-o
utp
ut
delay
sof
com
piler-
gen
eratedSR
AM
sin
the
caches.
A64-en
trybran
chtarg
etbuffer,
256-en
trytw
o-lev
elbran
chpred
ictor,
and
return
address
stackto
geth
erm
itigate
the
perfo
rman
ceim
pact
of
contro
lhazard
s.R
ock
etim
plem
ents
anM
MU
that
supports
pag
e-based
virtu
alm
emory
and
isab
leto
boot
modern
operatin
gsy
stems,
inclu
din
gL
inux.
Both
caches
arevirtu
allyin
dex
edphysically
tagged
with
parallel
TL
Blo
okups.
The
data
cache
isnon-b
lock
ing,
allow
ing
the
core
toex
plo
itm
emory
-level
parallelism
.
Rock
ethas
anoptio
nal
IEE
E754-2
008-co
mplian
tFPU
,w
hich
canex
ecute
single-
and
double-p
recisionfloatin
g-point
operatio
ns,
inclu
din
gfu
sedm
ultip
ly-ad
d(F
MA
),w
ithhard
-w
aresu
pport
for
den
orm
alsan
doth
erex
ceptio
nal
valu
es.T
he
PC
Gen.
I$
Acce
ss
ITLB
Int.E
XD
$
Acce
ss
DT
LB
Com
mit
FP.R
FF
P.E
X1
FP.E
X2
FP.E
X3
Inst.
De
co
de
Int.R
F
VI$
Acce
ss
VIT
LB
VIn
st.
De
co
de
Se
q-
uence
rE
xp
and
PC
Gen.
Ba
nk1
R/W
Ba
nk8
R/W
...
Rocke
t Pip
elin
e
to H
wa
ch
a
from
Ro
cke
tH
wach
a P
ipelin
e
byp
ass p
ath
s o
mitte
d
for s
imp
licity
Fig
.2.
Rock
etscalar
plu
sH
wach
avecto
rpip
eline
diag
ram.
4.5 mm
2.7mm
0.5mm 1.2 mm
18Credit: John Shalf (LBNL)
When does data movement dominate?
0.23mm
0.2 mm
Area: 12.25 mm2
Power: 2.5WClock: 2.4 GHzE/op: 651 pj
Area: 0.6 mm2
Power: 0.3W (<0.2W)Clock: 1.3 GHzE/op: 150 (75) pj
Core Energy/Area est.
Area: 0.046 mm2
Power: 0.025WClock: 1.0 GHzE/op: 22 pj
Data Movement Cost
Compute Op == data movement Energy @
108mmEnergy Ratio for 20mm
0.2x
Compute Op == data movement Energy @
12mmEnergy Ratio for 20mm
1.6x
Compute Op == data movement Energy @
3.6mmEnergy Ratio for 20mm
5.5x
A45nm
1.3
GH
z16.7
Double-P
recision
GFL
OPS/W
RIS
C-V
Pro
cessor
with
Vecto
rA
ccelerators
Yunsu
pL
ee⇤,
Andrew
Waterm
an⇤,
Rim
asA
vizien
is⇤,
Hen
ryC
ook⇤,
Chen
Sun⇤†,
Vlad
imir
Sto
janovic⇤†,
Krste
Asan
ovic⇤
Em
ail:{yunsu
p,
waterm
an,
rimas,
hco
ok,
vlad
a,krste}
@eecs.b
erkeley.ed
u,
sunch
en@
mit.ed
u⇤D
epartm
ent
of
Electrical
Engin
eering
and
Com
puter
Scien
ces,U
niv
ersityof
Califo
rnia,
Berk
eley,C
A,
USA
†Dep
artmen
tof
Electrical
Engin
eering
and
Com
puter
Scien
ce,M
assachusetts
Institu
teof
Tech
nolo
gy,
Cam
brid
ge,
MA
,U
SA
Abstra
ct—A
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
vecto
raccelera
tors
has
been
fabrica
tedin
a45
nm
SO
Ipro
cess.T
his
isth
efirst
dual-co
repro
cessor
toim
plem
ent
the
open
-source
RIS
C-V
ISA
desig
ned
at
the
Univ
ersityof
Califo
rnia
,B
erkeley.
Ina
standard
40
nm
pro
cess,th
eR
ISC
-Vsca
lar
core
scores
10%
hig
her
inD
MIP
S/M
Hz
than
the
Cortex
-A5,ARM’s
com
para
ble
single-issu
ein
-ord
ersca
lar
core,
and
is49%
more
area
-efficien
t.T
odem
onstra
teth
eex
tensib
ilityof
the
RIS
C-V
ISA
,w
ein
tegra
tea
custo
mvecto
raccelera
tor
alo
ngsid
eea
chsin
gle-issu
ein
-ord
ersca
lar
core.
The
vecto
raccelera
tor
is1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor,
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor,
both
fabrica
tedin
the
sam
epro
cess.T
he
dual-co
reR
ISC
-Vpro
cessor
ach
ieves
maxim
um
clock
frequen
cyof
1.3
GH
zat
1.2
Vand
pea
ken
ergyeffi
ciency
of
16.7
double-
precisio
nG
FL
OP
S/W
at
0.6
5V
with
an
area
of
3m
m2.
I.IN
TR
OD
UC
TIO
N
As
we
appro
achth
een
dof
conven
tional
transisto
rscalin
g,
com
puter
architects
arefo
rcedto
inco
rporate
specialized
and
hetero
gen
eous
accelerators
into
gen
eral-purp
ose
pro
cessors
for
greater
energ
yeffi
ciency.
Man
ypro
posed
accelerators,
such
asth
ose
based
on
GPU
architectu
res,req
uire
adrastic
rework
ing
of
applicatio
nso
ftware
tom
ake
use
of
separate
ISA
soperatin
gin
mem
ory
spaces
disjo
int
from
the
dem
and-p
aged
virtu
alm
emory
of
the
host
CPU
.R
ISC
-V[1
]is
anew
com
pletely
open
gen
eral-purp
ose
instru
ction
setarch
itecture
(ISA
)de-
velo
ped
atth
eU
niv
ersityof
Califo
rnia,
Berk
eley,w
hich
isdesig
ned
tobeflex
ible
and
exten
sible
tobetter
integ
ratenew
efficien
taccelerato
rsclo
seto
the
host
cores.
The
open
-source
RIS
C-V
softw
areto
olch
ainin
cludes
aG
CC
cross-co
mpiler,
anL
LV
Mcro
ss-com
piler,
aso
ftware
ISA
simulato
r,an
ISA
verifi
cation
suite,
aL
inux
port,
and
additio
nal
docu
men
tation,
and
isav
ailable
atw
ww
.r
is
cv
.o
rg
.
Inth
ispap
er,w
epresen
ta
64-b
itdual-co
reR
ISC
-Vpro
cessor
with
custo
mvecto
raccelerato
rsin
a45
nm
SO
Ipro
cess.O
ur
RIS
C-V
scalarco
reach
ieves
1.7
2D
MIP
S/M
Hz,
outp
erform
ingARM’s
Cortex
-A5
score
of
1.5
7D
MIP
S/M
Hz
by
10%
ina
smaller
footp
rint.
Our
custo
mvecto
raccelerato
ris
1.8⇥
more
energ
y-effi
cient
than
the
IBM
Blu
eG
ene/Q
pro
cessor
and
2.6⇥
more
than
the
IBM
Cell
pro
cessor
for
double-p
recisionfloatin
g-point
operatio
ns,
dem
onstratin
gth
athig
heffi
ciency
canbe
obtain
edw
ithoutsacrifi
cing
aunified
dem
and-p
aged
virtu
alm
emory
enviro
nm
ent.
II.C
HIP
AR
CH
ITE
CT
UR
E
Fig
ure
1sh
ow
sth
eblo
ckdiag
ramof
the
dual-co
repro
-cesso
r.E
achco
rein
corp
orates
a64-b
itsin
gle-issu
ein
-ord
erR
ock
etscalar
core,
a64-b
itH
wach
avecto
raccelerato
r,an
dth
eirasso
ciatedin
structio
nan
ddata
caches,
asdescrib
edbelo
w.
Du
al-C
ore
RIS
C-V
V
ecto
r Pro
cesso
r
Co
re
1M
B
SR
AM
Arra
y
L1
D$
VR
F
Co
re L
og
ic
3m
m
6mm
2.8
mm
1.1mm
Cohere
nce
Hub
FP
GA
FS
B/
HT
IF1M
B S
RA
M A
rray
L1
I$
L1
VI$
Rocke
t
Sca
lar
Core
Hw
acha
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Rocke
t
Sca
lar
Core
Hw
ach
a
Vecto
r
Acce
lera
tor
16K
L1I$
32K
L1D
$
8K
B
L1V
I$
Arb
iter
Fig
.1.
Back
side
chip
micro
grap
h(tak
enw
itha
removed
silicon
han
dle)
and
pro
cessor
blo
ckdiag
ram.
A.
Rocket
Sca
lar
Core
Rock
etis
a6-stag
esin
gle-issu
ein
-ord
erpip
eline
that
execu
testh
e64-b
itscalar
RIS
C-V
ISA
(seeFig
ure
2).
The
scalardatap
athis
fully
bypassed
butcarefu
llydesig
ned
tom
in-
imize
the
impact
of
long
clock
-to-o
utp
ut
delay
sof
com
piler-
gen
eratedSR
AM
sin
the
caches.
A64-en
trybran
chtarg
etbuffer,
256-en
trytw
o-lev
elbran
chpred
ictor,
and
return
address
stackto
geth
erm
itigate
the
perfo
rman
ceim
pact
of
contro
lhazard
s.R
ock
etim
plem
ents
anM
MU
that
supports
pag
e-based
virtu
alm
emory
and
isab
leto
boot
modern
operatin
gsy
stems,
inclu
din
gL
inux.
Both
caches
arevirtu
allyin
dex
edphysically
tagged
with
parallel
TL
Blo
okups.
The
data
cache
isnon-b
lock
ing,
allow
ing
the
core
toex
plo
itm
emory
-level
parallelism
.
Rock
ethas
anoptio
nal
IEE
E754-2
008-co
mplian
tFPU
,w
hich
canex
ecute
single-
and
double-p
recisionfloatin
g-point
operatio
ns,
inclu
din
gfu
sedm
ultip
ly-ad
d(F
MA
),w
ithhard
-w
aresu
pport
for
den
orm
alsan
doth
erex
ceptio
nal
valu
es.T
he
PC
Gen.
I$
Acce
ss
ITLB
Int.E
XD
$
Acce
ss
DT
LB
Com
mit
FP.R
FF
P.E
X1
FP.E
X2
FP.E
X3
Inst.
De
co
de
Int.R
F
VI$
Acce
ss
VIT
LB
VIn
st.
De
co
de
Se
q-
uence
rE
xp
and
PC
Gen.
Ba
nk1
R/W
Ba
nk8
R/W
...
Rocke
t Pip
elin
e
to H
wa
ch
a
from
Ro
cke
tH
wach
a P
ipelin
e
byp
ass p
ath
s o
mitte
d
for s
imp
licity
Fig
.2.
Rock
etscalar
plu
sH
wach
avecto
rpip
eline
diag
ram.
4.5 mm
2.7mm
0.5mm 1.2 mm
19Credit: John Shalf (LBNL)
DPHPC Overview
20
Goals of this lecture
Memory Trends
Cache Coherence
Memory Consistency
21
Memory – CPU gap widens
Measure processor speed as “throughput”
FLOPS/s, IOPS/s, …
Moore’s law - ~60% growth per year
Today’s architectures
POWER8: 338 dp GFLOP/s – 230 GB/s memory bw
BW i7-5775C: 883 GFLOPS/s ~50 GB/s memory bw
Trend: memory performance grows 10% per year
22
Source: Jack Dongarra
Source: John Mc.Calpin
Issues (AMD Interlagos as Example)
How to measure bandwidth?
Data sheet (often peak performance, may include overheads)
Frequency times bus width: 51 GiB/s
Microbenchmark performance
Stride 1 access (32 MiB): 32 GiB/s
Random access (8 B out of 32 MiB): 241 MiB/s
Why?
Application performance
As observed (performance counters)
Somewhere in between stride 1 and random access
How to measure Latency?
Data sheet (often optimistic, or not provided)
<100ns
Random pointer chase
110 ns with one core, 258 ns with 32 cores!
23
Conjecture: Buffering is a must!
Two most common examples:
Write Buffers
Delayed write back saves memory bandwidth
Data is often overwritten or re-read
Caching
Directory of recently used locations
Stored as blocks (cache lines)
24
Cache Coherence
Different caches may have a copy of the same memory location!
Cache coherence
Manages existence of multiple copies
Cache architectures
Multi level caches
Shared vs. private (partitioned)
Inclusive vs. exclusive
Write back vs. write through
Victim cache to reduce conflict misses
…
25
Exclusive Hierarchical Caches
26
Shared Hierarchical Caches
27
Shared Hierarchical Caches with MT
28
Caching Strategies (repeat)
Remember:
Write Back?
Write Through?
Cache coherence requirements
A memory system is coherent if it guarantees the following:
Write propagation (updates are eventually visible to all readers)
Write serialization (writes to the same location must be observed in order)
Everything else: memory model issues (later)
29
Write Through Cache
30
1. CPU0 reads X from memory• loads X=0 into its cache
2. CPU1 reads X from memory• loads X=0 into its cache
3. CPU0 writes X=1• stores X=1 in its cache• stores X=1 in memory
4. CPU1 reads X from its cache• loads X=0 from its cacheIncoherent value for X on CPU1
CPU1 may wait for update!
Requires write propagation!
Write Back Cache
31
1. CPU0 reads X from memory• loads X=0 into its cache
2. CPU1 reads X from memory• loads X=0 into its cache
3. CPU0 writes X=1• stores X=1 in its cache
4. CPU1 writes X =2 • stores X=2 in its cache
5. CPU1 writes back cache line• stores X=2 in in memory
6. CPU0 writes back cache line• stores X=1 in memoryLater store X=2 from CPU1 lost
Requires write serialization!
A simple (?) example
Assume C99:
Two threads:
Initially: a=b=0
Thread 0: write 1 to a
Thread 1: write 1 to b
Assume non-coherent write back cache
What may end up in main memory?
32
struct twoint {
int a;
int b;
}
Cache Coherence Protocol
Programmer can hardly deal with unpredictable behavior!
Cache controller maintains data integrity
All writes to different locations are visible
Snooping
Shared bus or (broadcast) network
Directory-based
Record information necessary to maintain coherence:
E.g., owner and state of a line etc.
33
Fundamental Mechanisms
Fundamental CC mechanisms
Snooping
Shared bus or (broadcast) network
Cache controller “snoops” all transactions
Monitors and changes the state of the cache’s data
Works at small scale, challenging at large-scale
E.g., Intel Broadwell
Directory-based
Record information necessary to maintain coherence
E.g., owner and state of a line etc.
Central/Distributed directory for cache line ownership
Scalable but more complex/expensive
E.g., Intel Xeon Phi KNC/KNL
34
Source: Intel
Cache Coherence Parameters
Concerns/Goals
Performance
Implementation cost (chip space, more important: dynamic energy)
Correctness
(Memory model side effects)
Issues
Detection (when does a controller need to act)
Enforcement (how does a controller guarantee coherence)
Precision of block sharing (per block, per sub-block?)
Block size (cache line size?)
35
An Engineering Approach: Empirical start
Problem 1: stale reads
Cache 1 holds value that was already modified in cache 2
Solution:
Disallow this state
Invalidate all remote copies before allowing a write to complete
Problem 2: lost update
Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data
Solution:
Disallow more than one modified copy
36
Invalidation vs. update (I)
Invalidation-based:
On each write of a shared line, it has to invalidate copies in remote caches
Simple implementation for bus-based systems:
Each cache snoops
Invalidate lines written by other CPUs
Signal sharing for cache lines in local cache to other caches
Update-based:
Local write updates copies in remote caches
Can update all CPUs at once
Multiple writes cause multiple updates (more traffic)
37
Invalidation vs. update (II)
Invalidation-based:
Only write misses hit the bus (works with write-back caches)
Subsequent writes to the same cache line are local
Good for multiple writes to the same line (in the same cache)
Update-based:
All sharers continue to hit cache line after one core writes
Implicit assumption: shared lines are accessed often
Supports producer-consumer pattern well
Many (local) writes may waste bandwidth!
Hybrid forms are possible!
38
Most common hardware implementation of discussed requirements
aka. “Illinois protocol”
Each line has one of the following states (in a cache):
Modified (M)
Local copy has been modified, no copies in other caches
Memory is stale
Exclusive (E)
No copies in other caches
Memory is up to date
Shared (S)
Unmodified copies may exist in other caches
Memory is up to date
Invalid (I)
Line is not in cache
MESI Cache Coherence
39
Terminology
Clean line:
Content of cache line and main memory is identical (also: memory is up to date)
Can be evicted without write-back
Dirty line:
Content of cache line and main memory differ (also: memory is stale)
Needs to be written back eventually
Time depends on protocol details
Bus transaction:
A signal on the bus that can be observed by all caches
Usually blocking
Local read/write:
A load/store operation originating at a core connected to the cache
40
Transitions in response to local reads
State is M
No bus transaction
State is E
No bus transaction
State is S
No bus transaction
State is I
Generate bus read request (BusRd)
May force other cache operations (see later)
Other cache(s) signal “sharing” if they hold a copy
If shared was signaled, go to state S
Otherwise, go to state E
After update: return read value
41
Transitions in response to local writes
State is M
No bus transaction
State is E
No bus transaction
Go to state M
State is S
Line already local & clean
There may be other copies
Generate bus read request for upgrade to exclusive (BusRdX*)
Go to state M
State is I
Generate bus read request for exclusive ownership (BusRdX)
Go to state M
42
Transitions in response to snooped BusRd
State is M
Write cache line back to main memory
Signal “shared”
Go to state S (or E)
State is E
Signal “shared”
Go to state S and signal “shared”
State is S
Signal “shared”
State is I
Ignore
43
Transitions in response to snooped BusRdX
State is M
Write cache line back to memory
Discard line and go to I
State is E
Discard line and go to I
State is S
Discard line and go to I
State is I
Ignore
BusRdX* is handled like BusRdX!
44
MESI State Diagram (FSM)
45Source: Wikipedia
Small Exercise
Initially: all in I state
46
Action P1 state P2 state P3 state Bus action Data from
P1 reads x
P2 reads x
P1 writes x
P1 reads x
P3 writes x
Small Exercise
Initially: all in I state
47
Action P1 state P2 state P3 state Bus action Data from
P1 reads x E I I BusRd Memory
P2 reads x S S I BusRd Cache
P1 writes x M I I BusRdX* Cache
P1 reads x M I I - Cache
P3 writes x I I M BusRdX Memory
Optimizations?
Class question: what could be optimized in the MESI protocol to make a system faster?
48
Related Protocols: MOESI (AMD)
Extended MESI protocol
Cache-to-cache transfer of modified cache lines
Cache in M or O state always transfers cache line to requesting cache
No need to contact (slow) main memory
Avoids write back when another process accesses cache line
Good when cache-to-cache performance is higher than cache-to-memory
E.g., shared last level cache!
49
MOESI State Diagram
50Source: AMD64 Architecture Programmer’s Manual
Related Protocols: MOESI (AMD)
Modified (M): Modified Exclusive
No copies in other caches, local copy dirty
Memory is stale, cache supplies copy (reply to BusRd*)
Owner (O): Modified Shared
Exclusive right to make changes
Other S copies may exist (“dirty sharing”)
Memory is stale, cache supplies copy (reply to BusRd*)
Exclusive (E):
Same as MESI (one local copy, up to date memory)
Shared (S):
Unmodified copy may exist in other caches
Memory is up to date unless an O copy exists in another cache
Invalid (I):
Same as MESI51
Related Protocols: MESIF (Intel)
Modified (M): Modified Exclusive
No copies in other caches, local copy dirty
Memory is stale, cache supplies copy (reply to BusRd*)
Exclusive (E):
Same as MESI (one local copy, up to date memory)
Shared (S):
Unmodified copy may exist in other caches
Memory is up to date
Invalid (I):
Same as MESI
Forward (F):
Special form of S state, other caches may have line in S
Most recent requester of line is in F state
Cache acts as responder for requests to this line52
Multi-level caches
Most systems have multi-level caches
Problem: only “last level cache” is connected to bus or network
Yet, snoop requests are relevant for inner-levels of cache (L1)
Modifications of L1 data may not be visible at L2 (and thus the bus)
L1/L2 modifications
On BusRd check if line is in M state in L1
It may be in E or S in L2!
On BusRdX(*) send invalidations to L1
Everything else can be handled in L2
If L1 is write through, L2 could “remember” state of L1 cache line
May increase traffic though
53
Directory-based cache coherence
Snooping does not scale
Bus transactions must be globally visible
Implies broadcast
Typical solution: tree-based (hierarchical) snooping
Root becomes a bottleneck
Directory-based schemes are more scalable
Directory (entry for each CL) keeps track of all owning caches
Point-to-point update to involved processors
No broadcast
Can use specialized (high-bandwidth) network, e.g., HT, QPI …
54
© Markus PüschelComputer Science
Basic Scheme
System with N processors Pi
For each memory block (size: cache line) maintain a directory entry
N presence bits
Set if block in cache of Pi
1 dirty bit
First proposed by Censier and Feautrier (1978)
55
Directory-based CC: Read miss
Pi intends to read, misses
If dirty bit (in directory) is off
Read from main memory
Set presence[i]
Supply data to reader
If dirty bit is on
Recall cache line from Pj (determine by presence[])
Update memory
Unset dirty bit, block shared
Set presence[i]
Supply data to reader
56
Directory-based CC: Write miss
Pi intends to write, misses
If dirty bit (in directory) is off
Send invalidations to all processors Pj with presence[j] turned on
Unset presence bit for all processors
Set dirty bit
Set presence[i], owner Pi
If dirty bit is on
Recall cache line from owner Pj
Update memory
Unset presence[j]
Set presence[i], dirty bit remains set
Supply data to writer
57
Discussion
Scaling of memory bandwidth
No centralized memory
Directory-based approaches scale with restrictions
Require presence bit for each cache
Number of bits determined at design time
Directory requires memory (size scales linearly)
Shared vs. distributed directory
Software-emulation
Distributed shared memory (DSM)
Emulate cache coherence in software (e.g., TreadMarks)
Often on a per-page basis, utilizes memory virtualization and paging
59
Open Problems (for projects or theses)
Tune algorithms to cache-coherence schemes
What is the optimal parallel algorithm for a given scheme?
Parameterize for an architecture
Measure and classify hardware
Read Maranget et al. “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models” and have fun!
RDMA consistency is barely understood!
GPU memories are not well understood!
Huge potential for new insights!
Can we program (easily) without cache coherence?
How to fix the problems with inconsistent values?
Compiler support (issues with arrays)?
60
Case Study: Intel Xeon Phi
61
Communication?
Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’1362
Local read: RL= 8.6 nsRemote read RR = 235 ns
Invalid read RI=278 ns
Inspired by Molka et al.: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system”63
Prediction for both in E state: 479 ns
Measurement: 497 ns (O=18)
Single-Line Ping Pong
64
More complex due to prefetch
Multi-Line Ping Pong
Asymptotic Fetch Latency for each cache
line (optimal prefetch!)
Number of CLs
Startup overhead
Amortization of startup
65
E state:
o=76 ns
q=1,521ns
p=1,096ns
I state:
o=95ns
q=2,750ns
p=2,017ns
Multi-Line Ping Pong
66
E state:
a=0ns
b=320ns
c=56.2ns
DTD Contention
67
top related