processor specification without a memory model arvind computer science & artificial intelligence...
TRANSCRIPT
Processor specification without a memory model
ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology
joint work with Murali Vijayaraghavan, Adam Chlipala
IFIPS WG2.8, Estes Park, Colorado August 13, 2014
August 14, 2014 1
Good newsResearch on memory models has penetrated the POPL communityMemory models used to be exclusively in the domain of computer architects until people realized that the Java memory model (Chapter 16 of Guy Steele’s reference manual did not make much sense)The fixes to Java memory model were proposed by computer architects and, to my chagrin, published by POPL+ drew attention of serious PL researchers- not surprisingly many papers followed showing that the new Java model was also broken
August 14, 2014 2
My concernPOPL community is too focused on trying to model warts and moles of existing processor memory models as opposed to defining what a good model ought to beFor me the Java memory model was broken because it did not have an operational (computational) interpretation. This danger is present in the current C11 effort.
now about processors...
August 14, 2014 3
Instruction set specificationsIBM was way ahead of its time when in 1964 it published an instruction set definition and 6 implementations of it
The state of the art at that time:
No serious problems until multiprocessor appeared (early seventies)
From the reference manual of IBM 650, a drum machine with 44 instructions,
Instruction: 60 1234 1009• “Load the contents of location 1234 into the
distribution; put it also into the upper accumulator; set lower accumulator to zero; and then go to location 1009 for the next instruction.”
August 14, 2014 4
Sequential Consistency (SC)A Memory Model
“A system is sequentially consistent if the result ofany execution is the same as if the operations of allthe processors were executed in some sequential order, and the operations of each individual processorappear in the order specified by the program”
Leslie Lamport
M
P P P P P P
Dijkstra/Dekker had already used this model in 1966 in their solution to the readers and writer problem (No multiprocessor – concurrency in OS)
August 14, 2014 L21-5
Memory Model Issue
Architectural optimizations that are correct for uniprocessors, often violate sequential consistency and result in a new memory model for multiprocessors
Memory models have rarely been designed deliberately by architects, rather they have emerged as a way of characterizing legal behaviors of processors they have built
August 14, 2014 6
A Real Multiprocessor
Processors are pipelined and generate many Ld/St requests which are processed in a pipelined and concurrent manner, while in Lamport’s SC system only one instruction and at most one memory request is processed a time!
P P P P P P
M
$ $ $ $ $ $
$ $ $
How to make the real multiprocessor behave like SC
Ld or St request
Ld responseSt Ack
Cache coherence msgs
August 14, 2014 L21-7
Implementing SCMake the real system process one memory request at a time
This is like throwing the baby out with the bath water
Let processors generate many requests but make caches “behave” in the SC manner
for performance reasons we need to process Ld requests out of order (cache-hit vs cache-miss) and in a pipelined manner
solution: make adjustments to both the memory caches and the pipelined processors
August 14, 2014 8
Two surprising factsCache Coherence (CC) protocols are not affected by memory models
puzzle: what is the correctness criteria for a CC protocol?
Processors also pay minimal attention to memory models but make sure it is possible to mimic SC behavior
Processors provide memory fences and other ad hoc techniques to enforce SC if desired
Perhaps we should think of a processor model independent of a memory model
August 14, 2014 9
A simple processor sans a memory model
issues a mem request and waits for an answer
a flag to indicate if the processor is waiting for a memory response
August 14, 2014 L21-10
Cache-coherent memory systems
Use cache-coherence protocols to make MC behave like MRO (aka as atomic store)
Transition systems for CC protocols are complex
M
$ $ $ $ $ $
$ $ $
M
M
colors represent addresses
MSC selects the first request from some queue
MRO selects the first request for any address from some queue
MC
August 14, 2014 12
Implementing SC using MRO
Since MRO can process requests out of order make the processor verify if processing the requests out of order makes any difference
How? Speculative Loads!
M
P P P P P P
August 14, 2014 13
Speculative processors (PS)
When a processor speculates correctly then pc must match npc; if does not then rob is emptied out and ppc is set to npcInstruction set specifies the next value of (state,npc) given the current value of (state,npc)Therefore, a correct PS must guarantee that the commit slot has the right values for (npc’,state’) when pc and npc match
ppcstate
npc
rob(reorder buffer)
next pc predictor commit slot
(pc, npc’, state’)
to/from memory
What about Ld? St?August 14, 2014 14
Speculative Ld/StNO speculative Sts; a St is issued only when it reaches the commit slotSpeculative Lds are harmless because they do not affect the state of the external memoryHowever, a speculative Ld can get a wrong value because PS may have read it too soon
Solution: issue the Ld again when it reaches the commit slot and compare the values;
mismatch means speculation failure (empty out rob)
To guarantee SC no more than one St or verification Ld should be outstanding!
August 14, 2014 15
Vector of systemsSystem P can be lifted to a vector of Ps: ([p],...) P (’...)
(s,...) P* (s[p:= ’],...)
August 14, 2014 19
System A simulating System B
Definition (A B) If system A can make a move then System B can also make a similar move
A B is aka A implements B or A is sound wrt B
s1 A s2
f(s1) B f(s2)+
f f
August 14, 2014 20
Main theoremsTheorem (cache-coherent memory):
MC MRO
Theorem (correctly speculating processor): PS Pref (one instruction-at-a-time)
Theorem: PsS + MC SC
This follows easily from the lemma PsRef + MRO Psref + MSC (Definition of SC)
All proofs are done in Coq by Murali under Adam’s guidance
August 14, 2014 21
ConclusionThe architectural idea of verification loads can be shown to work formallyProof checkers (e.g., Coq) have come of age – this was not the case in 2000; we had tried using PVS for a cache-coherence proofModular proof are essential to make progress in such complex systems
e.g., is the bug in the cache-coherence protocol or the speculative processor implementation
We can express these transition systems directly in Bluespec which can be synthesized into hardware
Thanks
August 14, 2014 22
Lamport’s SC systemdecode(pcs[p],s[p]) = Ld(a,x)
execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc’,)
(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)Ld rule
decode(pcs[p],s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc’,St(a,v))
(pcs,s,m) (pcs[p:=pc’], s,m[a:=v])St rule
decode(pcs[p],s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc’, )
(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)
Non-meminstruction
August 14, 2014 24
Speculative Out-of-order Processor
decode(pcs[p],s[p]) = Ld(a,x) execute(Ld(a,x),pcs[p], s[p],m[a]) = (pc’,)
(s,pc,) (pcs[p:=pc’], s[p:=+],m)Ld rule
decode(pcs[p],s[p]) = St(a,x) execute(St(a,x),pcs[p], s[p],_) = (pc’,St(a,v))
(pcs,s,m) (pcs[p:=pc’], s,m[a:=v])St rule
decode(pcs[p],s[p]) = Nm(a,x) execute(Nm(a,x),pcs[p], s[p],_) = (pc’, )
(pcs,s,m) (pcs[p:=pc’], s[p:=+],m)
Non-meminstruction
August 14, 2014 25
Decoupled SystemsProcessor transition system P: (pc,,p2m,m2p,...) (pc’,’,p2m’,m2p’, ...)
Memory transition system M: (m,p2m,m2p) (m,p2m’,m2p’)
Each system MC, MRO, Pref, MSC can be described as a transition system using a small number of rule
August 14, 2014 26