design of parallel and high-performance computing · pc. i$ e ss b int.ex d$ e ss b c ommi t fp f f...

66
Design of Parallel and High-Performance Computing Fall 2017 Lecture: Cache Coherence & Memory Models Instructor: Torsten Hoefler & Markus Püschel TAs: Salvatore Di Girolamo Motivational video: https://www.youtube.com/watch?v=zJybFF6PqEQ

Upload: others

Post on 24-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Design of Parallel and High-PerformanceComputingFall 2017Lecture: Cache Coherence & Memory Models

Instructor: Torsten Hoefler & Markus Püschel

TAs: Salvatore Di Girolamo

Motivational video: https://www.youtube.com/watch?v=zJybFF6PqEQ

Page 2: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Sorry for missing last week’s lecture!

2

Page 3: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

To talk about some silliness in MPI-3

3

Page 4: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Stadt- und Klimawandel: Wie

stellen wir uns den

Herausforderungen?

Mittwoch, 8. November 2016, 15.00 – 19.30 Uhr

ETH Zürich, Hauptgebäude

Informationen und Anmeldung:

http://www.c2sm.ethz.ch/events/eth-klimarunde-2017.html

Page 5: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de
Page 6: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

A more complete view

6

Page 7: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

large cache-

coherent multicore

machines

communicating

through coherent

memory access

and remote direct

memory access

Architecture Developments

’00-’05<1999 ’06-’12 ’13-’20 >2020

distributed

memory

machines

communicating

through

messages

large cache-

coherent multicore

machines

communicating

through coherent

memory access

and messages

coherent and non-

coherent

manycore

accelerators and

multicores

communicating

through memory

access and remote

direct memory

access

largely non-

coherent

accelerators and

multicores

communicating

through remote

direct memory

access

Sources: various vendors7

Page 8: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Physics (technological constraints) Cost of data movement

Capacity of DRAM cells

Clock frequencies (constrained by end of Dennard scaling)

Speed of Light

Melting point of silicon

Computer Architecture (design of the machine) Power management

ISA / Multithreading

SIMD widths

“Computer architecture, like other architecture, is the art of determining the needs of the user of a structure and then designing to meet those needs as effectively as possible within economic and technological constraints.” – Fred Brooks (IBM, 1962)

Have converted many former “power” problems into “cost” problems

Computer Architecture vs. Physics

8Credit: John Shalf (LBNL)

Page 9: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Low-Power Design Principles (2005)

Cubic power improvement with lower clock rate due to V2F

Slower clock rates enable use of simpler cores

Simpler cores use less area (lower leakage) and reduce cost

Tailor design to application to REDUCE WASTE

Intel Core2

Intel Atom

Tensilica XTensa

Power 5

9Credit: John Shalf (LBNL)

Page 10: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Low-Power Design Principles (2005)

Intel Core2

Intel Atom

Tensilica XTensa

Power 5

Power5 (server)

120W@1900MHz

Baseline

Intel Core2 sc (laptop) :

15W@1000MHz

4x more FLOPs/watt than baseline

Intel Atom (handhelds)

0.625W@800MHz

80x more

GPU Core or XTensa/Embedded

0.09W@600MHz

400x more (80x-120x sustained)

10Credit: John Shalf (LBNL)

Page 11: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Low-Power Design Principles (2005)Tensilica XTensa

Power5 (server)

120W@1900MHz

Baseline

Intel Core2 sc (laptop) :

15W@1000MHz

4x more FLOPs/watt than baseline

Intel Atom (handhelds)

0.625W@800MHz

80x more

GPU Core or XTensa/Embedded

0.09W@600MHz

400x more (80x-120x sustained)

Even if each simple core is 1/4th as computationally efficient as complex core, you

can fit hundreds of them on a single chip and still be 100x more power efficient.

11Credit: John Shalf (LBNL)

Page 12: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Heterogeneous Future (LOCs and TOCs)

0.23mm

0.2 mm

Tiny coreBig cores (very few)Lots of them!

Latency Optimized Core

(LOC)

Most energy efficient if you

don’t have lots of parallelism

Throughput Optimized Core

(TOC)

Most energy efficient if you DO

have a lot of parallelism!

12Credit: John Shalf (LBNL)

Page 13: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Energy Efficiency of copper wire: Power = Frequency * Length / cross-section-area

Wire efficiency does not improve as feature size shrinks

Energy Efficiency of a Transistor: Power = V2 * Frequency * Capacitance

Capacitance ~= Area of Transistor

Transistor efficiency improves as you shrink it

Net result is that moving data on wires is starting to cost more energy than computing on said data (interest in Silicon Photonics)

wire

Photonics could break through the bandwidth-distance limit

Data movement – the wires

13Credit: John Shalf (LBNL)

Page 14: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Moore’s law doesn’t apply to adding pins to package

30%+ per year nominal Moore’s Law

Pins grow at ~1.5-3% per year at best

4000 Pins is aggressive pin package

Half of those would need to be for power and ground

Of the remaining 2k pins, run as differential pairs

Beyond 15Gbps per pin power/complexity costs hurt!

10Gpbs * 1k pins is ~1.2TBytes/sec

2.5D Integration gets boost in pin density

But it’s a 1 time boost (how much headroom?)

4TB/sec? (maybe 8TB/s with single wire signaling?)

Pin Limits

14Credit: John Shalf (LBNL)

Page 15: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Die Photos (3 classes of cores)

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/WRISC-V Processor with Vector Accelerators

Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†,Vladimir Stojanovic⇤†, Krste Asanovic⇤

Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected]⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA

†Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract—A 64-bit dual-core RISC-V processor with vectoraccelerators has been fabr icated in a 45nm SOI process. Thisis the first dual-core processor to implement the open-sourceRISC-V ISA designed at the University of California, Berkeley.In a standard 40nm process, the RISC-V scalar core scores 10%higher in DMIPS/MHz than the Cortex-A5, ARM’s comparablesingle-issue in-order scalar core, and is 49% more area-efficient.To demonstrate the extensibility of the RISC-V ISA, we integratea custom vector accelerator alongside each single-issue in-orderscalar core. The vector accelerator is 1.8⇥ more energy-efficientthan the IBM Blue Gene/Q processor, and 2.6⇥ more than theIBM Cell processor, both fabr icated in the same process. Thedual-core RISC-V processor achieves maximum clock frequencyof 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2 .

I . INTRODUCTION

As we approach the end of conventional transistor scaling,computer architects are forced to incorporate specialized andheterogeneous accelerators into general-purpose processors forgreater energy efficiency. Many proposed accelerators, such asthose based on GPU architectures, require a drastic reworkingof application software to make use of separate ISAs operatingin memory spaces disjoint from the demand-paged virtualmemory of the host CPU. RISC-V [1] is a new completelyopen general-purpose instruction set architecture (ISA) de-veloped at the University of California, Berkeley, which isdesigned to be flexible and extensible to better integrate newefficient accelerators close to the host cores. The open-sourceRISC-V software toolchain includes a GCC cross-compiler,an LLVM cross-compiler, a software ISA simulator, an ISAverification suite, a Linux port, and additional documentation,and is available at www. r i scv. or g.

In this paper, we present a 64-bit dual-core RISC-Vprocessor with custom vector accelerators in a 45nm SOIprocess. Our RISC-V scalar core achieves 1.72DMIPS/MHz,outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHzby 10% in a smaller footprint. Our custom vector acceleratoris 1.8⇥ more energy-efficient than the IBM Blue Gene/Qprocessor and 2.6⇥ more than the IBM Cell processor fordouble-precision floating-point operations, demonstrating thathigh efficiency can be obtained without sacrificing a unifieddemand-paged virtual memory environment.

I I . CHIP ARCHITECTURE

Figure 1 shows the block diagram of the dual-core pro-cessor. Each core incorporates a 64-bit single-issue in-orderRocket scalar core, a 64-bit Hwacha vector accelerator, andtheir associated instruction and data caches, as describedbelow.

Dual-Core RISC-V Vector Processor

Core

1MB

SRAM

Array

L1D$VRF

Core Logic

3mm

6m

m

2.8mm

1.1

mm

Coherence Hub

FPGA FSB/

HTIF1MB SRAM Array

L1I$

L1VI$

Rocket

Scalar

Core

Hwacha

Vector

Accelerator

16K

L1I$

32K

L1D$

8KB

L1VI$

Arbiter

Rocket

Scalar

Core

Hwacha

Vector

Accelerator

16K

L1I$

32K

L1D$

8KB

L1VI$

Arbiter

Fig. 1. Backside chip micrograph (taken with a removed silicon handle) andprocessor block diagram.

A. Rocket Scalar Core

Rocket is a 6-stage single-issue in-order pipeline thatexecutes the 64-bit scalar RISC-V ISA (see Figure 2). Thescalar datapath is fully bypassed but carefully designed to min-imize the impact of long clock-to-output delays of compiler-generated SRAMs in the caches. A 64-entry branch targetbuffer, 256-entry two-level branch predictor, and return addressstack together mitigate the performance impact of controlhazards. Rocket implements an MMU that supports page-based virtual memory and is able to boot modern operatingsystems, including Linux. Both caches are virtually indexedphysically tagged with parallel TLB lookups. The data cacheis non-blocking, allowing the core to exploit memory-levelparallelism.

Rocket has an optional IEEE 754-2008-compliant FPU,which can execute single- and double-precision floating-pointoperations, including fused multiply-add (FMA), with hard-ware support for denormals and other exceptional values. The

PC

Gen.I$

Access

ITLB

Int.EX D$

Access

DTLB

Commit

FP.RF FP.EX1 FP.EX2 FP.EX3

Inst.

Decode

Int.RF

VI$

Access

VITLBVInst.

Decode

Seq-

uencerExpand

PC

Gen.Bank1

R/W

Bank8

R/W...

Rocket Pipeline

to Hwacha

from Rocket Hwacha Pipeline

bypass paths omitted

for simplicity

Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.

8.4mm

20mm

1.2mm

0.5 mm

0.23mm

0.2 mm

15Credit: John Shalf (LBNL)

Page 16: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Strip down to the core

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/WRISC-V Processor with Vector Accelerators

Yunsup Lee⇤, Andrew Waterman⇤, Rimas Avizienis⇤, Henry Cook⇤, Chen Sun⇤†,Vladimir Stojanovic⇤†, Krste Asanovic⇤

Email: { yunsup, waterman, rimas, hcook, vlada, krste} @eecs.berkeley.edu, [email protected]⇤Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA

†Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract—A 64-bit dual-core RISC-V processor with vectoraccelerators has been fabr icated in a 45nm SOI process. Thisis the first dual-core processor to implement the open-sourceRISC-V ISA designed at the University of California, Berkeley.In a standard 40nm process, the RISC-V scalar core scores 10%higher in DMIPS/MHz than the Cortex-A5, ARM’s comparablesingle-issue in-order scalar core, and is 49% more area-efficient.To demonstrate the extensibility of the RISC-V ISA, we integratea custom vector accelerator alongside each single-issue in-orderscalar core. The vector accelerator is 1.8⇥ more energy-efficientthan the IBM Blue Gene/Q processor, and 2.6⇥ more than theIBM Cell processor, both fabr icated in the same process. Thedual-core RISC-V processor achieves maximum clock frequencyof 1.3GHz at 1.2V and peak energy efficiency of 16.7 double-precision GFLOPS/W at 0.65V with an area of 3mm2 .

I . INTRODUCTION

As we approach the end of conventional transistor scaling,computer architects are forced to incorporate specialized andheterogeneous accelerators into general-purpose processors forgreater energy efficiency. Many proposed accelerators, such asthose based on GPU architectures, require a drastic reworkingof application software to make use of separate ISAs operatingin memory spaces disjoint from the demand-paged virtualmemory of the host CPU. RISC-V [1] is a new completelyopen general-purpose instruction set architecture (ISA) de-veloped at the University of California, Berkeley, which isdesigned to be flexible and extensible to better integrate newefficient accelerators close to the host cores. The open-sourceRISC-V software toolchain includes a GCC cross-compiler,an LLVM cross-compiler, a software ISA simulator, an ISAverification suite, a Linux port, and additional documentation,and is available at www. r i scv. or g.

In this paper, we present a 64-bit dual-core RISC-Vprocessor with custom vector accelerators in a 45nm SOIprocess. Our RISC-V scalar core achieves 1.72DMIPS/MHz,outperforming ARM’s Cortex-A5 score of 1.57DMIPS/MHzby 10% in a smaller footprint. Our custom vector acceleratoris 1.8⇥ more energy-efficient than the IBM Blue Gene/Qprocessor and 2.6⇥ more than the IBM Cell processor fordouble-precision floating-point operations, demonstrating thathigh efficiency can be obtained without sacrificing a unifieddemand-paged virtual memory environment.

I I . CHIP ARCHITECTURE

Figure 1 shows the block diagram of the dual-core pro-cessor. Each core incorporates a 64-bit single-issue in-orderRocket scalar core, a 64-bit Hwacha vector accelerator, andtheir associated instruction and data caches, as describedbelow.

Dual-Core RISC-V Vector Processor

Core

1MB

SRAM

Array

L1D$VRF

Core Logic

3mm

6m

m

2.8mm

1.1

mm

Coherence Hub

FPGA FSB/

HTIF1MB SRAM Array

L1I$

L1VI$

Rocket

Scalar

Core

Hwacha

Vector

Accelerator

16K

L1I$

32K

L1D$

8KB

L1VI$

Arbiter

Rocket

Scalar

Core

Hwacha

Vector

Accelerator

16K

L1I$

32K

L1D$

8KB

L1VI$

Arbiter

Fig. 1. Backside chip micrograph (taken with a removed silicon handle) andprocessor block diagram.

A. Rocket Scalar Core

Rocket is a 6-stage single-issue in-order pipeline thatexecutes the 64-bit scalar RISC-V ISA (see Figure 2). Thescalar datapath is fully bypassed but carefully designed to min-imize the impact of long clock-to-output delays of compiler-generated SRAMs in the caches. A 64-entry branch targetbuffer, 256-entry two-level branch predictor, and return addressstack together mitigate the performance impact of controlhazards. Rocket implements an MMU that supports page-based virtual memory and is able to boot modern operatingsystems, including Linux. Both caches are virtually indexedphysically tagged with parallel TLB lookups. The data cacheis non-blocking, allowing the core to exploit memory-levelparallelism.

Rocket has an optional IEEE 754-2008-compliant FPU,which can execute single- and double-precision floating-pointoperations, including fused multiply-add (FMA), with hard-ware support for denormals and other exceptional values. The

PC

Gen.I$

Access

ITLB

Int.EX D$

Access

DTLB

Commit

FP.RF FP.EX1 FP.EX2 FP.EX3

Inst.

Decode

Int.RF

VI$

Access

VITLBVInst.

Decode

Seq-

uencerExpand

PC

Gen.Bank1

R/W

Bank8

R/W...

Rocket Pipeline

to Hwacha

from Rocket Hwacha Pipeline

bypass paths omitted

for simplicity

Fig. 2. Rocket scalar plus Hwacha vector pipeline diagram.

2.7 mm

4.5 mm

1.2mm

0.5 mm

0.23mm

0.2 mm

16Credit: John Shalf (LBNL)

Page 17: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Actual Size

A45nm

1.3

GH

z16.7

Double-P

recision

GFL

OPS/W

RIS

C-V

Pro

cessor

with

Vecto

rA

ccelerators

Yunsu

pL

ee⇤,

Andrew

Waterm

an⇤,

Rim

asA

vizien

is⇤,

Hen

ryC

ook⇤,

Chen

Sun⇤†,

Vlad

imir

Sto

janovic⇤†,

Krste

Asan

ovic⇤

Em

ail:{yunsu

p,

waterm

an,

rimas,

hco

ok,

vlad

a,krste}

@eecs.b

erkeley.ed

u,

sunch

en@

mit.ed

u⇤D

epartm

ent

of

Electrical

Engin

eering

and

Com

puter

Scien

ces,U

niv

ersityof

Califo

rnia,

Berk

eley,C

A,

USA

†Dep

artmen

tof

Electrical

Engin

eering

and

Com

puter

Scien

ce,M

assachusetts

Institu

teof

Tech

nolo

gy,

Cam

brid

ge,

MA

,U

SA

Abstra

ct—A

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

vecto

raccelera

tors

has

been

fabrica

tedin

a45

nm

SO

Ipro

cess.T

his

isth

efirst

dual-co

repro

cessor

toim

plem

ent

the

open

-source

RIS

C-V

ISA

desig

ned

at

the

Univ

ersityof

Califo

rnia

,B

erkeley.

Ina

standard

40

nm

pro

cess,th

eR

ISC

-Vsca

lar

core

scores

10%

hig

her

inD

MIP

S/M

Hz

than

the

Cortex

-A5,ARM’s

com

para

ble

single-issu

ein

-ord

ersca

lar

core,

and

is49%

more

area

-efficien

t.T

odem

onstra

teth

eex

tensib

ilityof

the

RIS

C-V

ISA

,w

ein

tegra

tea

custo

mvecto

raccelera

tor

alo

ngsid

eea

chsin

gle-issu

ein

-ord

ersca

lar

core.

The

vecto

raccelera

tor

is1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor,

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor,

both

fabrica

tedin

the

sam

epro

cess.T

he

dual-co

reR

ISC

-Vpro

cessor

ach

ieves

maxim

um

clock

frequen

cyof

1.3

GH

zat

1.2

Vand

pea

ken

ergyeffi

ciency

of

16.7

double-

precisio

nG

FL

OP

S/W

at

0.6

5V

with

an

area

of

3m

m2.

I.IN

TR

OD

UC

TIO

N

As

we

appro

achth

een

dof

conven

tional

transisto

rscalin

g,

com

puter

architects

arefo

rcedto

inco

rporate

specialized

and

hetero

gen

eous

accelerators

into

gen

eral-purp

ose

pro

cessors

for

greater

energ

yeffi

ciency.

Man

ypro

posed

accelerators,

such

asth

ose

based

on

GPU

architectu

res,req

uire

adrastic

rework

ing

of

applicatio

nso

ftware

tom

ake

use

of

separate

ISA

soperatin

gin

mem

ory

spaces

disjo

int

from

the

dem

and-p

aged

virtu

alm

emory

of

the

host

CPU

.R

ISC

-V[1

]is

anew

com

pletely

open

gen

eral-purp

ose

instru

ction

setarch

itecture

(ISA

)de-

velo

ped

atth

eU

niv

ersityof

Califo

rnia,

Berk

eley,w

hich

isdesig

ned

tobeflex

ible

and

exten

sible

tobetter

integ

ratenew

efficien

taccelerato

rsclo

seto

the

host

cores.

The

open

-source

RIS

C-V

softw

areto

olch

ainin

cludes

aG

CC

cross-co

mpiler,

anL

LV

Mcro

ss-com

piler,

aso

ftware

ISA

simulato

r,an

ISA

verifi

cation

suite,

aL

inux

port,

and

additio

nal

docu

men

tation,

and

isav

ailable

atw

ww

.r

is

cv

.o

rg

.

Inth

ispap

er,w

epresen

ta

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

custo

mvecto

raccelerato

rsin

a45

nm

SO

Ipro

cess.O

ur

RIS

C-V

scalarco

reach

ieves

1.7

2D

MIP

S/M

Hz,

outp

erform

ingARM’s

Cortex

-A5

score

of

1.5

7D

MIP

S/M

Hz

by

10%

ina

smaller

footp

rint.

Our

custo

mvecto

raccelerato

ris

1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor

for

double-p

recisionfloatin

g-point

operatio

ns,

dem

onstratin

gth

athig

heffi

ciency

canbe

obtain

edw

ithoutsacrifi

cing

aunified

dem

and-p

aged

virtu

alm

emory

enviro

nm

ent.

II.C

HIP

AR

CH

ITE

CT

UR

E

Fig

ure

1sh

ow

sth

eblo

ckdiag

ramof

the

dual-co

repro

-cesso

r.E

achco

rein

corp

orates

a64-b

itsin

gle-issu

ein

-ord

erR

ock

etscalar

core,

a64-b

itH

wach

avecto

raccelerato

r,an

dth

eirasso

ciatedin

structio

nan

ddata

caches,

asdescrib

edbelo

w.

Du

al-C

ore

RIS

C-V

V

ecto

r Pro

cesso

r

Co

re

1M

B

SR

AM

Arra

y

L1

D$

VR

F

Co

re L

og

ic

3m

m

6mm

2.8

mm

1.1mm

Cohere

nce

Hub

FP

GA

FS

B/

HT

IF1M

B S

RA

M A

rray

L1

I$

L1

VI$

Rocke

t

Sca

lar

Core

Hw

acha

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Rocke

t

Sca

lar

Core

Hw

ach

a

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Fig

.1.

Back

side

chip

micro

grap

h(tak

enw

itha

removed

silicon

han

dle)

and

pro

cessor

blo

ckdiag

ram.

A.

Rocket

Sca

lar

Core

Rock

etis

a6-stag

esin

gle-issu

ein

-ord

erpip

eline

that

execu

testh

e64-b

itscalar

RIS

C-V

ISA

(seeFig

ure

2).

The

scalardatap

athis

fully

bypassed

butcarefu

llydesig

ned

tom

in-

imize

the

impact

of

long

clock

-to-o

utp

ut

delay

sof

com

piler-

gen

eratedSR

AM

sin

the

caches.

A64-en

trybran

chtarg

etbuffer,

256-en

trytw

o-lev

elbran

chpred

ictor,

and

return

address

stackto

geth

erm

itigate

the

perfo

rman

ceim

pact

of

contro

lhazard

s.R

ock

etim

plem

ents

anM

MU

that

supports

pag

e-based

virtu

alm

emory

and

isab

leto

boot

modern

operatin

gsy

stems,

inclu

din

gL

inux.

Both

caches

arevirtu

allyin

dex

edphysically

tagged

with

parallel

TL

Blo

okups.

The

data

cache

isnon-b

lock

ing,

allow

ing

the

core

toex

plo

itm

emory

-level

parallelism

.

Rock

ethas

anoptio

nal

IEE

E754-2

008-co

mplian

tFPU

,w

hich

canex

ecute

single-

and

double-p

recisionfloatin

g-point

operatio

ns,

inclu

din

gfu

sedm

ultip

ly-ad

d(F

MA

),w

ithhard

-w

aresu

pport

for

den

orm

alsan

doth

erex

ceptio

nal

valu

es.T

he

PC

Gen.

I$

Acce

ss

ITLB

Int.E

XD

$

Acce

ss

DT

LB

Com

mit

FP.R

FF

P.E

X1

FP.E

X2

FP.E

X3

Inst.

De

co

de

Int.R

F

VI$

Acce

ss

VIT

LB

VIn

st.

De

co

de

Se

q-

uence

rE

xp

and

PC

Gen.

Ba

nk1

R/W

Ba

nk8

R/W

...

Rocke

t Pip

elin

e

to H

wa

ch

a

from

Ro

cke

tH

wach

a P

ipelin

e

byp

ass p

ath

s o

mitte

d

for s

imp

licity

Fig

.2.

Rock

etscalar

plu

sH

wach

avecto

rpip

eline

diag

ram.

4.5 mm

2.7mm

0.23mm

0.2 mm

0.5mm 1.2 mm

17Credit: John Shalf (LBNL)

Page 18: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Basic Stats

0.23mm

0.2 mm

Area: 12.25 mm2

Power: 2.5WClock: 2.4 GHzE/op: 651 pj

Area: 0.6 mm2

Power: 0.3W (<0.2W)Clock: 1.3 GHzE/op: 150 (75) pj

Core Energy/Area est.

Area: 0.046 mm2

Power: 0.025WClock: 1.0 GHzE/op: 22 pj

Wire EnergyAssumptions for 22nm

100 fj/bit per mm64bit operand

Energy:1mm=~6pj20mm=~120pj

A45nm

1.3

GH

z16.7

Double-P

recision

GFL

OPS/W

RIS

C-V

Pro

cessor

with

Vecto

rA

ccelerators

Yunsu

pL

ee⇤,

Andrew

Waterm

an⇤,

Rim

asA

vizien

is⇤,

Hen

ryC

ook⇤,

Chen

Sun⇤†,

Vlad

imir

Sto

janovic⇤†,

Krste

Asan

ovic⇤

Em

ail:{yunsu

p,

waterm

an,

rimas,

hco

ok,

vlad

a,krste}

@eecs.b

erkeley.ed

u,

sunch

en@

mit.ed

u⇤D

epartm

ent

of

Electrical

Engin

eering

and

Com

puter

Scien

ces,U

niv

ersityof

Califo

rnia,

Berk

eley,C

A,

USA

†Dep

artmen

tof

Electrical

Engin

eering

and

Com

puter

Scien

ce,M

assachusetts

Institu

teof

Tech

nolo

gy,

Cam

brid

ge,

MA

,U

SA

Abstra

ct—A

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

vecto

raccelera

tors

has

been

fabrica

tedin

a45

nm

SO

Ipro

cess.T

his

isth

efirst

dual-co

repro

cessor

toim

plem

ent

the

open

-source

RIS

C-V

ISA

desig

ned

at

the

Univ

ersityof

Califo

rnia

,B

erkeley.

Ina

standard

40

nm

pro

cess,th

eR

ISC

-Vsca

lar

core

scores

10%

hig

her

inD

MIP

S/M

Hz

than

the

Cortex

-A5,ARM’s

com

para

ble

single-issu

ein

-ord

ersca

lar

core,

and

is49%

more

area

-efficien

t.T

odem

onstra

teth

eex

tensib

ilityof

the

RIS

C-V

ISA

,w

ein

tegra

tea

custo

mvecto

raccelera

tor

alo

ngsid

eea

chsin

gle-issu

ein

-ord

ersca

lar

core.

The

vecto

raccelera

tor

is1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor,

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor,

both

fabrica

tedin

the

sam

epro

cess.T

he

dual-co

reR

ISC

-Vpro

cessor

ach

ieves

maxim

um

clock

frequen

cyof

1.3

GH

zat

1.2

Vand

pea

ken

ergyeffi

ciency

of

16.7

double-

precisio

nG

FL

OP

S/W

at

0.6

5V

with

an

area

of

3m

m2.

I.IN

TR

OD

UC

TIO

N

As

we

appro

achth

een

dof

conven

tional

transisto

rscalin

g,

com

puter

architects

arefo

rcedto

inco

rporate

specialized

and

hetero

gen

eous

accelerators

into

gen

eral-purp

ose

pro

cessors

for

greater

energ

yeffi

ciency.

Man

ypro

posed

accelerators,

such

asth

ose

based

on

GPU

architectu

res,req

uire

adrastic

rework

ing

of

applicatio

nso

ftware

tom

ake

use

of

separate

ISA

soperatin

gin

mem

ory

spaces

disjo

int

from

the

dem

and-p

aged

virtu

alm

emory

of

the

host

CPU

.R

ISC

-V[1

]is

anew

com

pletely

open

gen

eral-purp

ose

instru

ction

setarch

itecture

(ISA

)de-

velo

ped

atth

eU

niv

ersityof

Califo

rnia,

Berk

eley,w

hich

isdesig

ned

tobeflex

ible

and

exten

sible

tobetter

integ

ratenew

efficien

taccelerato

rsclo

seto

the

host

cores.

The

open

-source

RIS

C-V

softw

areto

olch

ainin

cludes

aG

CC

cross-co

mpiler,

anL

LV

Mcro

ss-com

piler,

aso

ftware

ISA

simulato

r,an

ISA

verifi

cation

suite,

aL

inux

port,

and

additio

nal

docu

men

tation,

and

isav

ailable

atw

ww

.r

is

cv

.o

rg

.

Inth

ispap

er,w

epresen

ta

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

custo

mvecto

raccelerato

rsin

a45

nm

SO

Ipro

cess.O

ur

RIS

C-V

scalarco

reach

ieves

1.7

2D

MIP

S/M

Hz,

outp

erform

ingARM’s

Cortex

-A5

score

of

1.5

7D

MIP

S/M

Hz

by

10%

ina

smaller

footp

rint.

Our

custo

mvecto

raccelerato

ris

1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor

for

double-p

recisionfloatin

g-point

operatio

ns,

dem

onstratin

gth

athig

heffi

ciency

canbe

obtain

edw

ithoutsacrifi

cing

aunified

dem

and-p

aged

virtu

alm

emory

enviro

nm

ent.

II.C

HIP

AR

CH

ITE

CT

UR

E

Fig

ure

1sh

ow

sth

eblo

ckdiag

ramof

the

dual-co

repro

-cesso

r.E

achco

rein

corp

orates

a64-b

itsin

gle-issu

ein

-ord

erR

ock

etscalar

core,

a64-b

itH

wach

avecto

raccelerato

r,an

dth

eirasso

ciatedin

structio

nan

ddata

caches,

asdescrib

edbelo

w.

Du

al-C

ore

RIS

C-V

V

ecto

r Pro

cesso

r

Co

re

1M

B

SR

AM

Arra

y

L1

D$

VR

F

Co

re L

og

ic

3m

m

6mm

2.8

mm

1.1mm

Cohere

nce

Hub

FP

GA

FS

B/

HT

IF1M

B S

RA

M A

rray

L1

I$

L1

VI$

Rocke

t

Sca

lar

Core

Hw

acha

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Rocke

t

Sca

lar

Core

Hw

ach

a

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Fig

.1.

Back

side

chip

micro

grap

h(tak

enw

itha

removed

silicon

han

dle)

and

pro

cessor

blo

ckdiag

ram.

A.

Rocket

Sca

lar

Core

Rock

etis

a6-stag

esin

gle-issu

ein

-ord

erpip

eline

that

execu

testh

e64-b

itscalar

RIS

C-V

ISA

(seeFig

ure

2).

The

scalardatap

athis

fully

bypassed

butcarefu

llydesig

ned

tom

in-

imize

the

impact

of

long

clock

-to-o

utp

ut

delay

sof

com

piler-

gen

eratedSR

AM

sin

the

caches.

A64-en

trybran

chtarg

etbuffer,

256-en

trytw

o-lev

elbran

chpred

ictor,

and

return

address

stackto

geth

erm

itigate

the

perfo

rman

ceim

pact

of

contro

lhazard

s.R

ock

etim

plem

ents

anM

MU

that

supports

pag

e-based

virtu

alm

emory

and

isab

leto

boot

modern

operatin

gsy

stems,

inclu

din

gL

inux.

Both

caches

arevirtu

allyin

dex

edphysically

tagged

with

parallel

TL

Blo

okups.

The

data

cache

isnon-b

lock

ing,

allow

ing

the

core

toex

plo

itm

emory

-level

parallelism

.

Rock

ethas

anoptio

nal

IEE

E754-2

008-co

mplian

tFPU

,w

hich

canex

ecute

single-

and

double-p

recisionfloatin

g-point

operatio

ns,

inclu

din

gfu

sedm

ultip

ly-ad

d(F

MA

),w

ithhard

-w

aresu

pport

for

den

orm

alsan

doth

erex

ceptio

nal

valu

es.T

he

PC

Gen.

I$

Acce

ss

ITLB

Int.E

XD

$

Acce

ss

DT

LB

Com

mit

FP.R

FF

P.E

X1

FP.E

X2

FP.E

X3

Inst.

De

co

de

Int.R

F

VI$

Acce

ss

VIT

LB

VIn

st.

De

co

de

Se

q-

uence

rE

xp

and

PC

Gen.

Ba

nk1

R/W

Ba

nk8

R/W

...

Rocke

t Pip

elin

e

to H

wa

ch

a

from

Ro

cke

tH

wach

a P

ipelin

e

byp

ass p

ath

s o

mitte

d

for s

imp

licity

Fig

.2.

Rock

etscalar

plu

sH

wach

avecto

rpip

eline

diag

ram.

4.5 mm

2.7mm

0.5mm 1.2 mm

18Credit: John Shalf (LBNL)

Page 19: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

When does data movement dominate?

0.23mm

0.2 mm

Area: 12.25 mm2

Power: 2.5WClock: 2.4 GHzE/op: 651 pj

Area: 0.6 mm2

Power: 0.3W (<0.2W)Clock: 1.3 GHzE/op: 150 (75) pj

Core Energy/Area est.

Area: 0.046 mm2

Power: 0.025WClock: 1.0 GHzE/op: 22 pj

Data Movement Cost

Compute Op == data movement Energy @

108mmEnergy Ratio for 20mm

0.2x

Compute Op == data movement Energy @

12mmEnergy Ratio for 20mm

1.6x

Compute Op == data movement Energy @

3.6mmEnergy Ratio for 20mm

5.5x

A45nm

1.3

GH

z16.7

Double-P

recision

GFL

OPS/W

RIS

C-V

Pro

cessor

with

Vecto

rA

ccelerators

Yunsu

pL

ee⇤,

Andrew

Waterm

an⇤,

Rim

asA

vizien

is⇤,

Hen

ryC

ook⇤,

Chen

Sun⇤†,

Vlad

imir

Sto

janovic⇤†,

Krste

Asan

ovic⇤

Em

ail:{yunsu

p,

waterm

an,

rimas,

hco

ok,

vlad

a,krste}

@eecs.b

erkeley.ed

u,

sunch

en@

mit.ed

u⇤D

epartm

ent

of

Electrical

Engin

eering

and

Com

puter

Scien

ces,U

niv

ersityof

Califo

rnia,

Berk

eley,C

A,

USA

†Dep

artmen

tof

Electrical

Engin

eering

and

Com

puter

Scien

ce,M

assachusetts

Institu

teof

Tech

nolo

gy,

Cam

brid

ge,

MA

,U

SA

Abstra

ct—A

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

vecto

raccelera

tors

has

been

fabrica

tedin

a45

nm

SO

Ipro

cess.T

his

isth

efirst

dual-co

repro

cessor

toim

plem

ent

the

open

-source

RIS

C-V

ISA

desig

ned

at

the

Univ

ersityof

Califo

rnia

,B

erkeley.

Ina

standard

40

nm

pro

cess,th

eR

ISC

-Vsca

lar

core

scores

10%

hig

her

inD

MIP

S/M

Hz

than

the

Cortex

-A5,ARM’s

com

para

ble

single-issu

ein

-ord

ersca

lar

core,

and

is49%

more

area

-efficien

t.T

odem

onstra

teth

eex

tensib

ilityof

the

RIS

C-V

ISA

,w

ein

tegra

tea

custo

mvecto

raccelera

tor

alo

ngsid

eea

chsin

gle-issu

ein

-ord

ersca

lar

core.

The

vecto

raccelera

tor

is1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor,

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor,

both

fabrica

tedin

the

sam

epro

cess.T

he

dual-co

reR

ISC

-Vpro

cessor

ach

ieves

maxim

um

clock

frequen

cyof

1.3

GH

zat

1.2

Vand

pea

ken

ergyeffi

ciency

of

16.7

double-

precisio

nG

FL

OP

S/W

at

0.6

5V

with

an

area

of

3m

m2.

I.IN

TR

OD

UC

TIO

N

As

we

appro

achth

een

dof

conven

tional

transisto

rscalin

g,

com

puter

architects

arefo

rcedto

inco

rporate

specialized

and

hetero

gen

eous

accelerators

into

gen

eral-purp

ose

pro

cessors

for

greater

energ

yeffi

ciency.

Man

ypro

posed

accelerators,

such

asth

ose

based

on

GPU

architectu

res,req

uire

adrastic

rework

ing

of

applicatio

nso

ftware

tom

ake

use

of

separate

ISA

soperatin

gin

mem

ory

spaces

disjo

int

from

the

dem

and-p

aged

virtu

alm

emory

of

the

host

CPU

.R

ISC

-V[1

]is

anew

com

pletely

open

gen

eral-purp

ose

instru

ction

setarch

itecture

(ISA

)de-

velo

ped

atth

eU

niv

ersityof

Califo

rnia,

Berk

eley,w

hich

isdesig

ned

tobeflex

ible

and

exten

sible

tobetter

integ

ratenew

efficien

taccelerato

rsclo

seto

the

host

cores.

The

open

-source

RIS

C-V

softw

areto

olch

ainin

cludes

aG

CC

cross-co

mpiler,

anL

LV

Mcro

ss-com

piler,

aso

ftware

ISA

simulato

r,an

ISA

verifi

cation

suite,

aL

inux

port,

and

additio

nal

docu

men

tation,

and

isav

ailable

atw

ww

.r

is

cv

.o

rg

.

Inth

ispap

er,w

epresen

ta

64-b

itdual-co

reR

ISC

-Vpro

cessor

with

custo

mvecto

raccelerato

rsin

a45

nm

SO

Ipro

cess.O

ur

RIS

C-V

scalarco

reach

ieves

1.7

2D

MIP

S/M

Hz,

outp

erform

ingARM’s

Cortex

-A5

score

of

1.5

7D

MIP

S/M

Hz

by

10%

ina

smaller

footp

rint.

Our

custo

mvecto

raccelerato

ris

1.8⇥

more

energ

y-effi

cient

than

the

IBM

Blu

eG

ene/Q

pro

cessor

and

2.6⇥

more

than

the

IBM

Cell

pro

cessor

for

double-p

recisionfloatin

g-point

operatio

ns,

dem

onstratin

gth

athig

heffi

ciency

canbe

obtain

edw

ithoutsacrifi

cing

aunified

dem

and-p

aged

virtu

alm

emory

enviro

nm

ent.

II.C

HIP

AR

CH

ITE

CT

UR

E

Fig

ure

1sh

ow

sth

eblo

ckdiag

ramof

the

dual-co

repro

-cesso

r.E

achco

rein

corp

orates

a64-b

itsin

gle-issu

ein

-ord

erR

ock

etscalar

core,

a64-b

itH

wach

avecto

raccelerato

r,an

dth

eirasso

ciatedin

structio

nan

ddata

caches,

asdescrib

edbelo

w.

Du

al-C

ore

RIS

C-V

V

ecto

r Pro

cesso

r

Co

re

1M

B

SR

AM

Arra

y

L1

D$

VR

F

Co

re L

og

ic

3m

m

6mm

2.8

mm

1.1mm

Cohere

nce

Hub

FP

GA

FS

B/

HT

IF1M

B S

RA

M A

rray

L1

I$

L1

VI$

Rocke

t

Sca

lar

Core

Hw

acha

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Rocke

t

Sca

lar

Core

Hw

ach

a

Vecto

r

Acce

lera

tor

16K

L1I$

32K

L1D

$

8K

B

L1V

I$

Arb

iter

Fig

.1.

Back

side

chip

micro

grap

h(tak

enw

itha

removed

silicon

han

dle)

and

pro

cessor

blo

ckdiag

ram.

A.

Rocket

Sca

lar

Core

Rock

etis

a6-stag

esin

gle-issu

ein

-ord

erpip

eline

that

execu

testh

e64-b

itscalar

RIS

C-V

ISA

(seeFig

ure

2).

The

scalardatap

athis

fully

bypassed

butcarefu

llydesig

ned

tom

in-

imize

the

impact

of

long

clock

-to-o

utp

ut

delay

sof

com

piler-

gen

eratedSR

AM

sin

the

caches.

A64-en

trybran

chtarg

etbuffer,

256-en

trytw

o-lev

elbran

chpred

ictor,

and

return

address

stackto

geth

erm

itigate

the

perfo

rman

ceim

pact

of

contro

lhazard

s.R

ock

etim

plem

ents

anM

MU

that

supports

pag

e-based

virtu

alm

emory

and

isab

leto

boot

modern

operatin

gsy

stems,

inclu

din

gL

inux.

Both

caches

arevirtu

allyin

dex

edphysically

tagged

with

parallel

TL

Blo

okups.

The

data

cache

isnon-b

lock

ing,

allow

ing

the

core

toex

plo

itm

emory

-level

parallelism

.

Rock

ethas

anoptio

nal

IEE

E754-2

008-co

mplian

tFPU

,w

hich

canex

ecute

single-

and

double-p

recisionfloatin

g-point

operatio

ns,

inclu

din

gfu

sedm

ultip

ly-ad

d(F

MA

),w

ithhard

-w

aresu

pport

for

den

orm

alsan

doth

erex

ceptio

nal

valu

es.T

he

PC

Gen.

I$

Acce

ss

ITLB

Int.E

XD

$

Acce

ss

DT

LB

Com

mit

FP.R

FF

P.E

X1

FP.E

X2

FP.E

X3

Inst.

De

co

de

Int.R

F

VI$

Acce

ss

VIT

LB

VIn

st.

De

co

de

Se

q-

uence

rE

xp

and

PC

Gen.

Ba

nk1

R/W

Ba

nk8

R/W

...

Rocke

t Pip

elin

e

to H

wa

ch

a

from

Ro

cke

tH

wach

a P

ipelin

e

byp

ass p

ath

s o

mitte

d

for s

imp

licity

Fig

.2.

Rock

etscalar

plu

sH

wach

avecto

rpip

eline

diag

ram.

4.5 mm

2.7mm

0.5mm 1.2 mm

19Credit: John Shalf (LBNL)

Page 20: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

DPHPC Overview

20

Page 21: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Goals of this lecture

Memory Trends

Cache Coherence

Memory Consistency

21

Page 22: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Memory – CPU gap widens

Measure processor speed as “throughput”

FLOPS/s, IOPS/s, …

Moore’s law - ~60% growth per year

Today’s architectures

POWER8: 338 dp GFLOP/s – 230 GB/s memory bw

BW i7-5775C: 883 GFLOPS/s ~50 GB/s memory bw

Trend: memory performance grows 10% per year

22

Source: Jack Dongarra

Source: John Mc.Calpin

Page 23: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Issues (AMD Interlagos as Example)

How to measure bandwidth?

Data sheet (often peak performance, may include overheads)

Frequency times bus width: 51 GiB/s

Microbenchmark performance

Stride 1 access (32 MiB): 32 GiB/s

Random access (8 B out of 32 MiB): 241 MiB/s

Why?

Application performance

As observed (performance counters)

Somewhere in between stride 1 and random access

How to measure Latency?

Data sheet (often optimistic, or not provided)

<100ns

Random pointer chase

110 ns with one core, 258 ns with 32 cores!

23

Page 24: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Conjecture: Buffering is a must!

Two most common examples:

Write Buffers

Delayed write back saves memory bandwidth

Data is often overwritten or re-read

Caching

Directory of recently used locations

Stored as blocks (cache lines)

24

Page 25: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Cache Coherence

Different caches may have a copy of the same memory location!

Cache coherence

Manages existence of multiple copies

Cache architectures

Multi level caches

Shared vs. private (partitioned)

Inclusive vs. exclusive

Write back vs. write through

Victim cache to reduce conflict misses

25

Page 26: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Exclusive Hierarchical Caches

26

Page 27: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Shared Hierarchical Caches

27

Page 28: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Shared Hierarchical Caches with MT

28

Page 29: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Caching Strategies (repeat)

Remember:

Write Back?

Write Through?

Cache coherence requirements

A memory system is coherent if it guarantees the following:

Write propagation (updates are eventually visible to all readers)

Write serialization (writes to the same location must be observed in order)

Everything else: memory model issues (later)

29

Page 30: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Write Through Cache

30

1. CPU0 reads X from memory• loads X=0 into its cache

2. CPU1 reads X from memory• loads X=0 into its cache

3. CPU0 writes X=1• stores X=1 in its cache• stores X=1 in memory

4. CPU1 reads X from its cache• loads X=0 from its cacheIncoherent value for X on CPU1

CPU1 may wait for update!

Requires write propagation!

Page 31: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Write Back Cache

31

1. CPU0 reads X from memory• loads X=0 into its cache

2. CPU1 reads X from memory• loads X=0 into its cache

3. CPU0 writes X=1• stores X=1 in its cache

4. CPU1 writes X =2 • stores X=2 in its cache

5. CPU1 writes back cache line• stores X=2 in in memory

6. CPU0 writes back cache line• stores X=1 in memoryLater store X=2 from CPU1 lost

Requires write serialization!

Page 32: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

A simple (?) example

Assume C99:

Two threads:

Initially: a=b=0

Thread 0: write 1 to a

Thread 1: write 1 to b

Assume non-coherent write back cache

What may end up in main memory?

32

struct twoint {

int a;

int b;

}

Page 33: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Cache Coherence Protocol

Programmer can hardly deal with unpredictable behavior!

Cache controller maintains data integrity

All writes to different locations are visible

Snooping

Shared bus or (broadcast) network

Directory-based

Record information necessary to maintain coherence:

E.g., owner and state of a line etc.

33

Fundamental Mechanisms

Page 34: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Fundamental CC mechanisms

Snooping

Shared bus or (broadcast) network

Cache controller “snoops” all transactions

Monitors and changes the state of the cache’s data

Works at small scale, challenging at large-scale

E.g., Intel Broadwell

Directory-based

Record information necessary to maintain coherence

E.g., owner and state of a line etc.

Central/Distributed directory for cache line ownership

Scalable but more complex/expensive

E.g., Intel Xeon Phi KNC/KNL

34

Source: Intel

Page 35: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Cache Coherence Parameters

Concerns/Goals

Performance

Implementation cost (chip space, more important: dynamic energy)

Correctness

(Memory model side effects)

Issues

Detection (when does a controller need to act)

Enforcement (how does a controller guarantee coherence)

Precision of block sharing (per block, per sub-block?)

Block size (cache line size?)

35

Page 36: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

An Engineering Approach: Empirical start

Problem 1: stale reads

Cache 1 holds value that was already modified in cache 2

Solution:

Disallow this state

Invalidate all remote copies before allowing a write to complete

Problem 2: lost update

Incorrect write back of modified line writes main memory in different order from the order of the write operations or overwrites neighboring data

Solution:

Disallow more than one modified copy

36

Page 37: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Invalidation vs. update (I)

Invalidation-based:

On each write of a shared line, it has to invalidate copies in remote caches

Simple implementation for bus-based systems:

Each cache snoops

Invalidate lines written by other CPUs

Signal sharing for cache lines in local cache to other caches

Update-based:

Local write updates copies in remote caches

Can update all CPUs at once

Multiple writes cause multiple updates (more traffic)

37

Page 38: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Invalidation vs. update (II)

Invalidation-based:

Only write misses hit the bus (works with write-back caches)

Subsequent writes to the same cache line are local

Good for multiple writes to the same line (in the same cache)

Update-based:

All sharers continue to hit cache line after one core writes

Implicit assumption: shared lines are accessed often

Supports producer-consumer pattern well

Many (local) writes may waste bandwidth!

Hybrid forms are possible!

38

Page 39: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Most common hardware implementation of discussed requirements

aka. “Illinois protocol”

Each line has one of the following states (in a cache):

Modified (M)

Local copy has been modified, no copies in other caches

Memory is stale

Exclusive (E)

No copies in other caches

Memory is up to date

Shared (S)

Unmodified copies may exist in other caches

Memory is up to date

Invalid (I)

Line is not in cache

MESI Cache Coherence

39

Page 40: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Terminology

Clean line:

Content of cache line and main memory is identical (also: memory is up to date)

Can be evicted without write-back

Dirty line:

Content of cache line and main memory differ (also: memory is stale)

Needs to be written back eventually

Time depends on protocol details

Bus transaction:

A signal on the bus that can be observed by all caches

Usually blocking

Local read/write:

A load/store operation originating at a core connected to the cache

40

Page 41: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Transitions in response to local reads

State is M

No bus transaction

State is E

No bus transaction

State is S

No bus transaction

State is I

Generate bus read request (BusRd)

May force other cache operations (see later)

Other cache(s) signal “sharing” if they hold a copy

If shared was signaled, go to state S

Otherwise, go to state E

After update: return read value

41

Page 42: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Transitions in response to local writes

State is M

No bus transaction

State is E

No bus transaction

Go to state M

State is S

Line already local & clean

There may be other copies

Generate bus read request for upgrade to exclusive (BusRdX*)

Go to state M

State is I

Generate bus read request for exclusive ownership (BusRdX)

Go to state M

42

Page 43: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Transitions in response to snooped BusRd

State is M

Write cache line back to main memory

Signal “shared”

Go to state S (or E)

State is E

Signal “shared”

Go to state S and signal “shared”

State is S

Signal “shared”

State is I

Ignore

43

Page 44: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Transitions in response to snooped BusRdX

State is M

Write cache line back to memory

Discard line and go to I

State is E

Discard line and go to I

State is S

Discard line and go to I

State is I

Ignore

BusRdX* is handled like BusRdX!

44

Page 45: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

MESI State Diagram (FSM)

45Source: Wikipedia

Page 46: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Small Exercise

Initially: all in I state

46

Action P1 state P2 state P3 state Bus action Data from

P1 reads x

P2 reads x

P1 writes x

P1 reads x

P3 writes x

Page 47: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Small Exercise

Initially: all in I state

47

Action P1 state P2 state P3 state Bus action Data from

P1 reads x E I I BusRd Memory

P2 reads x S S I BusRd Cache

P1 writes x M I I BusRdX* Cache

P1 reads x M I I - Cache

P3 writes x I I M BusRdX Memory

Page 48: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Optimizations?

Class question: what could be optimized in the MESI protocol to make a system faster?

48

Page 49: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Related Protocols: MOESI (AMD)

Extended MESI protocol

Cache-to-cache transfer of modified cache lines

Cache in M or O state always transfers cache line to requesting cache

No need to contact (slow) main memory

Avoids write back when another process accesses cache line

Good when cache-to-cache performance is higher than cache-to-memory

E.g., shared last level cache!

49

Page 50: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

MOESI State Diagram

50Source: AMD64 Architecture Programmer’s Manual

Page 51: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Related Protocols: MOESI (AMD)

Modified (M): Modified Exclusive

No copies in other caches, local copy dirty

Memory is stale, cache supplies copy (reply to BusRd*)

Owner (O): Modified Shared

Exclusive right to make changes

Other S copies may exist (“dirty sharing”)

Memory is stale, cache supplies copy (reply to BusRd*)

Exclusive (E):

Same as MESI (one local copy, up to date memory)

Shared (S):

Unmodified copy may exist in other caches

Memory is up to date unless an O copy exists in another cache

Invalid (I):

Same as MESI51

Page 52: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Related Protocols: MESIF (Intel)

Modified (M): Modified Exclusive

No copies in other caches, local copy dirty

Memory is stale, cache supplies copy (reply to BusRd*)

Exclusive (E):

Same as MESI (one local copy, up to date memory)

Shared (S):

Unmodified copy may exist in other caches

Memory is up to date

Invalid (I):

Same as MESI

Forward (F):

Special form of S state, other caches may have line in S

Most recent requester of line is in F state

Cache acts as responder for requests to this line52

Page 53: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Multi-level caches

Most systems have multi-level caches

Problem: only “last level cache” is connected to bus or network

Yet, snoop requests are relevant for inner-levels of cache (L1)

Modifications of L1 data may not be visible at L2 (and thus the bus)

L1/L2 modifications

On BusRd check if line is in M state in L1

It may be in E or S in L2!

On BusRdX(*) send invalidations to L1

Everything else can be handled in L2

If L1 is write through, L2 could “remember” state of L1 cache line

May increase traffic though

53

Page 54: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Directory-based cache coherence

Snooping does not scale

Bus transactions must be globally visible

Implies broadcast

Typical solution: tree-based (hierarchical) snooping

Root becomes a bottleneck

Directory-based schemes are more scalable

Directory (entry for each CL) keeps track of all owning caches

Point-to-point update to involved processors

No broadcast

Can use specialized (high-bandwidth) network, e.g., HT, QPI …

54

Page 55: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

© Markus PüschelComputer Science

Basic Scheme

System with N processors Pi

For each memory block (size: cache line) maintain a directory entry

N presence bits

Set if block in cache of Pi

1 dirty bit

First proposed by Censier and Feautrier (1978)

55

Page 56: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Directory-based CC: Read miss

Pi intends to read, misses

If dirty bit (in directory) is off

Read from main memory

Set presence[i]

Supply data to reader

If dirty bit is on

Recall cache line from Pj (determine by presence[])

Update memory

Unset dirty bit, block shared

Set presence[i]

Supply data to reader

56

Page 57: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Directory-based CC: Write miss

Pi intends to write, misses

If dirty bit (in directory) is off

Send invalidations to all processors Pj with presence[j] turned on

Unset presence bit for all processors

Set dirty bit

Set presence[i], owner Pi

If dirty bit is on

Recall cache line from owner Pj

Update memory

Unset presence[j]

Set presence[i], dirty bit remains set

Supply data to writer

57

Page 58: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Discussion

Scaling of memory bandwidth

No centralized memory

Directory-based approaches scale with restrictions

Require presence bit for each cache

Number of bits determined at design time

Directory requires memory (size scales linearly)

Shared vs. distributed directory

Software-emulation

Distributed shared memory (DSM)

Emulate cache coherence in software (e.g., TreadMarks)

Often on a per-page basis, utilizes memory virtualization and paging

59

Page 59: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Open Problems (for projects or theses)

Tune algorithms to cache-coherence schemes

What is the optimal parallel algorithm for a given scheme?

Parameterize for an architecture

Measure and classify hardware

Read Maranget et al. “A Tutorial Introduction to the ARM and POWER Relaxed Memory Models” and have fun!

RDMA consistency is barely understood!

GPU memories are not well understood!

Huge potential for new insights!

Can we program (easily) without cache coherence?

How to fix the problems with inconsistent values?

Compiler support (issues with arrays)?

60

Page 60: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Case Study: Intel Xeon Phi

61

Page 61: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Communication?

Ramos, Hoefler: “Modeling Communication in Cache-Coherent SMP Systems - A Case-Study with Xeon Phi “, HPDC’1362

Page 62: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Local read: RL= 8.6 nsRemote read RR = 235 ns

Invalid read RI=278 ns

Inspired by Molka et al.: “Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system”63

Page 63: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

Prediction for both in E state: 479 ns

Measurement: 497 ns (O=18)

Single-Line Ping Pong

64

Page 64: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

More complex due to prefetch

Multi-Line Ping Pong

Asymptotic Fetch Latency for each cache

line (optimal prefetch!)

Number of CLs

Startup overhead

Amortization of startup

65

Page 65: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

E state:

o=76 ns

q=1,521ns

p=1,096ns

I state:

o=95ns

q=2,750ns

p=2,017ns

Multi-Line Ping Pong

66

Page 66: Design of Parallel and High-Performance Computing · PC. I$ e ss B Int.EX D$ e ss B C ommi t FP F F P.EX1 F P.EX2 F P.EX3 Inst . D e co de F VI $ e ss VI T L B VI nst . D e co de

E state:

a=0ns

b=320ns

c=56.2ns

DTD Contention

67