processor specification without a memory model arvind computer science & artificial intelligence...

Processor specification without a memory model

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

joint work with Murali Vijayaraghavan, Adam Chlipala

IFIPS WG2.8, Estes Park, Colorado August 13, 2014

August 14, 2014 1

Good newsResearch on memory models has penetrated the POPL communityMemory models used to be exclusively in the domain of computer architects until people realized that the Java memory model (Chapter 16 of Guy Steele’s reference manual did not make much sense)The fixes to Java memory model were proposed by computer architects and, to my chagrin, published by POPL+ drew attention of serious PL researchers- not surprisingly many papers followed showing that the new Java model was also broken

August 14, 2014 2

My concernPOPL community is too focused on trying to model warts and moles of existing processor memory models as opposed to defining what a good model ought to beFor me the Java memory model was broken because it did not have an operational (computational) interpretation. This danger is present in the current C11 effort.

now about processors...

August 14, 2014 3

Instruction set specificationsIBM was way ahead of its time when in 1964 it published an instruction set definition and 6 implementations of it

The state of the art at that time:

No serious problems until multiprocessor appeared (early seventies)

From the reference manual of IBM 650, a drum machine with 44 instructions,

Instruction: 60 1234 1009• “Load the contents of location 1234 into the

distribution; put it also into the upper accumulator; set lower accumulator to zero; and then go to location 1009 for the next instruction.”

August 14, 2014 4

Sequential Consistency (SC)A Memory Model

“A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program”

Leslie Lamport

M

P P P P P P

Dijkstra/Dekker had already used this model in 1966 in their solution to the readers and writer problem (No multiprocessor – concurrency in OS)

August 14, 2014 L21-5

Memory Model Issue

Architectural optimizations that are correct for uniprocessors, often violate sequential consistency and result in a new memory model for multiprocessors

Memory models have rarely been designed deliberately by architects, rather they have emerged as a way of characterizing legal behaviors of processors they have built

August 14, 2014 6

A Real Multiprocessor

Processors are pipelined and generate many Ld/St requests which are processed in a pipelined and concurrent manner, while in Lamport’s SC system only one instruction and at most one memory request is processed a time!

P P P P P P

M

$ $ $ $ $ $

$ $ $

How to make the real multiprocessor behave like SC

Ld or St request

Ld responseSt Ack

Cache coherence msgs

August 14, 2014 L21-7

Implementing SCMake the real system process one memory request at a time

This is like throwing the baby out with the bath water

Let processors generate many requests but make caches “behave” in the SC manner

for performance reasons we need to process Ld requests out of order (cache-hit vs cache-miss) and in a pipelined manner

solution: make adjustments to both the memory caches and the pipelined processors

August 14, 2014 8

Two surprising factsCache Coherence (CC) protocols are not affected by memory models

puzzle: what is the correctness criteria for a CC protocol?

Processors also pay minimal attention to memory models but make sure it is possible to mimic SC behavior

Processors provide memory fences and other ad hoc techniques to enforce SC if desired

Perhaps we should think of a processor model independent of a memory model

August 14, 2014 9

A simple processor sans a memory model

issues a mem request and waits for an answer

a flag to indicate if the processor is waiting for a memory response

August 14, 2014 L21-10

A simple multiported memory system

August 14, 2014 L21-11

Cache-coherent memory systems

Use cache-coherence protocols to make MC behave like MRO (aka as atomic store)

Transition systems for CC protocols are complex

M

$ $ $ $ $ $

$ $ $

M

M

colors represent addresses

MSC selects the first request from some queue

MRO selects the first request for any address from some queue

MC

August 14, 2014 12

Implementing SC using MRO

Since MRO can process requests out of order make the processor verify if processing the requests out of order makes any difference

How? Speculative Loads!

M

P P P P P P

August 14, 2014 13

Speculative processors (PS)

When a processor speculates correctly then pc must match npc; if does not then rob is emptied out and ppc is set to npcInstruction set specifies the next value of (state,npc) given the current value of (state,npc)Therefore, a correct PS must guarantee that the commit slot has the right values for (npc’,state’) when pc and npc match

ppcstate

npc

rob(reorder buffer)

next pc predictor commit slot

(pc, npc’, state’)

to/from memory

What about Ld? St?August 14, 2014 14

Speculative Ld/StNO speculative Sts; a St is issued only when it reaches the commit slotSpeculative Lds are harmless because they do not affect the state of the external memoryHowever, a speculative Ld can get a wrong value because PS may have read it too soon

Solution: issue the Ld again when it reaches the commit slot and compare the values;

mismatch means speculation failure (empty out rob)

To guarantee SC no more than one St or verification Ld should be outstanding!

August 14, 2014 15

Speculative Processor

August 14, 2014 L21-16

PS Load Instruction

verification load

August 14, 2014 L21-17

PS Store Instruction

August 14, 2014 L21-18

Vector of systemsSystem P can be lifted to a vector of Ps: ([p],...) P (’...)

(s,...) P* (s[p:= ’],...)

August 14, 2014 19

System A simulating System B

Definition (A B) If system A can make a move then System B can also make a similar move

A B is aka A implements B or A is sound wrt B

s1 A s2

f(s1) B f(s2)+

f f

August 14, 2014 20

Main theoremsTheorem (cache-coherent memory):

MC MRO

Theorem (correctly speculating processor): PS Pref (one instruction-at-a-time)

Theorem: PsS + MC SC

This follows easily from the lemma PsRef + MRO Psref + MSC (Definition of SC)

All proofs are done in Coq by Murali under Adam’s guidance

August 14, 2014 21

ConclusionThe architectural idea of verification loads can be shown to work formallyProof checkers (e.g., Coq) have come of age – this was not the case in 2000; we had tried using PVS for a cache-coherence proofModular proof are essential to make progress in such complex systems

e.g., is the bug in the cache-coherence protocol or the speculative processor implementation

We can express these transition systems directly in Bluespec which can be synthesized into hardware

Thanks

August 14, 2014 22

Extras

August 14, 2014 23

Lamport’s SC systemdecode(pcs[p],s[p]) = Ld(a,x)

execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc’,)

(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)Ld rule

decode(pcs[p],s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc’,St(a,v))

(pcs,s,m) (pcs[p:=pc’], s,m[a:=v])St rule

decode(pcs[p],s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc’, )

(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)

Non-meminstruction

August 14, 2014 24

Speculative Out-of-order Processor

decode(pcs[p],s[p]) = Ld(a,x) execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc’,)

(s,pc,) (pcs[p:=pc’], s[p:=+],m)Ld rule

decode(pcs[p],s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc’,St(a,v))

(pcs,s,m) (pcs[p:=pc’], s,m[a:=v])St rule

decode(pcs[p],s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc’, )

(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)

Non-meminstruction

August 14, 2014 25

Decoupled SystemsProcessor transition system P: (pc,,p2m,m2p,...) (pc’,’,p2m’,m2p’, ...)

Memory transition system M: (m,p2m,m2p) (m,p2m’,m2p’)

Each system MC, MRO, Pref, MSC can be described as a transition system using a small number of rule

August 14, 2014 26

Coupling two SystemsLet the two systems have shared variable s and not shared variables x and y System A: (s,x) A (s’,x’)

System B: (s,y) B (s’,y’)

System A+B: (s,x) A (s’,y’)

(s,x,y) A+B (s’,x’,y)

(s,y) B (s’,y’)

(s,x,y) A+B (s’,x,y’)

We can thus couple M to Ps

August 14, 2014 27

processor specification without a memory model arvind computer science & artificial intelligence...

Documents

new memory model

memory request

java memory model chapter

good model

multiprocessors memory

new java model

popl community memory

real multiprocessor