multiprocessor architectures for speculative multithreading josep torrellas, university of illinois...
TRANSCRIPT
![Page 1: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/1.jpg)
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois
The Bulk Multicore Architecture for Programmability
Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu
![Page 2: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/2.jpg)
Josep TorrellasThe BULK Multicore Architecture
2
Acknowledgments
Key contributors: • Luis Ceze• Calin Cascaval• James Tuck• Pablo Montesinos• Wonsun Ahn• Milos Prvulovic• Pin Zhou• YY Zhou• Jose Martinez
![Page 3: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/3.jpg)
Josep TorrellasThe BULK Multicore Architecture
3
Challenges for Multicore Designers
• 100 cores/chip coming & there is little parallel SW
– System architecture should support programmable environment
• User-friendly concurrency and consistency models
• Always-on production-run debugging
![Page 4: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/4.jpg)
Josep TorrellasThe BULK Multicore Architecture
4
Challenges for Multicore Designers (II)
• Decreasing transistor size will make designing cores hard
– Design rules too hard to satisfy manually just shrink the core
– Cores will become commodity
• Big cores, small cores, specialized cores…
– Innovation will be in cache hierarchy & network
![Page 5: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/5.jpg)
Josep TorrellasThe BULK Multicore Architecture
5
Challenges for Multicore Designers (III)
• We will be adding accelerators:
– Accelerators need to have the same (simple) interface to the cache coherent fabric as processors
– Need simple memory consistency models
![Page 6: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/6.jpg)
Josep TorrellasThe BULK Multicore Architecture
6
A Vision of Year 2015-2018 Multicore
• 128+ cores per chip
• Simple shared-memory programming model(s):– Support for shared memory (perhaps in groups of procs)– Enforce interleaving restrictions imposed by the language &
concurrency model– Sequential memory consistency model for simplicity
• Sophisticated always-on debugging environment– Deterministic replay of parallel programs with no log– Data race detection at production-run speed– Pervasive program monitoring
![Page 7: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/7.jpg)
Josep TorrellasThe BULK Multicore Architecture
7
Proposal: The Bulk Multicore
• Idea: Eliminate the commit of individual instructions at a time
• Mechanism:
– Default is processors commit chunks of instructions at a time (e.g. 2,000 dynamic instr)
– Chunks execute atomically and in isolation (using buffering and undo)
– Memory effects of chunks summarized in HW signatures
• Advantages over current:
– Higher programmability
– Higher performance
– Simpler hardware
The Bulk
Multicore
![Page 8: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/8.jpg)
Josep TorrellasThe BULK Multicore Architecture
8
Rest of the Talk
• The Bulk Multicore• How it improves programmability• What’s next?
![Page 9: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/9.jpg)
Josep TorrellasThe BULK Multicore Architecture
9
Hardware Mechanism: Signatures
• Hardware accumulates the addresses read/written in signatures
• Read and Write signatures
• Summarize the footprint of a Chunk of code
[ISCA06]
![Page 10: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/10.jpg)
Josep TorrellasThe BULK Multicore Architecture
10
Signature Operations In Hardware
Inexpensive
Operations on
Groups of
Addresses
![Page 11: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/11.jpg)
Josep TorrellasThe BULK Multicore Architecture
11
Executing Chunks Atomically & In Isolation: Simple!
commit W0
W0 = sig(B,C)R0 = sig(X,Y)
W1 = sig(T)R1 = sig(B,C)
(W0 ∩ R1 ) (W0 ∩ W1)
ld Xst Bst Cld Y
Chunk
Thread 0 Thread 1
ld Bst Tld C
![Page 12: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/12.jpg)
Josep TorrellasThe BULK Multicore Architecture
12
Chunk Operation + Signatures: Bulk
st D
st A
P1
ld C
st C ld Dst X
st A
P2
st C
• Execute each chunk atomically and in isolation
• (Distributed) arbiter ensures a total order of chunk commits
[ISCA07]
• Supports Sequential Consistency [Lamport79]:
– Low hardware complexity:
– High performance:
P1 P2 P3 PN...
Mem
Logical picture
st Ast C
ld Dst X
st Ald C
st Dst C
Need not snoop ld buffer for consistency
Instructions are fully reordered by HW
(loads and stores make it in any order to the sig)
![Page 13: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/13.jpg)
Josep TorrellasThe BULK Multicore Architecture
13
Summary: Benefits of Bulk Multicore
• Gains in HW simplicity, performance, and programmability
• Hardware simplicity:
– Memory consistency support moved away from core
– Toward commodity cores
– Easy to plug-in accelerators
• High performance:
– HW reorders accesses heavily (intra- and inter-chunk)
![Page 14: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/14.jpg)
Josep TorrellasThe BULK Multicore Architecture
14
Benefits of Bulk Multicore (II)
• High programmability:
– Invisible to the programming model/language
– Supports Sequential Consistency (SC)
* Software correctness tools assume SC
– Enables novel always-on debugging techniques
* Only keep per-chunk state, not per-load/store state
* Deterministic replay of parallel programs with no log
* Data race detection at production-run speed
![Page 15: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/15.jpg)
Josep TorrellasThe BULK Multicore Architecture
15
Benefits of Bulk Multicore (III)
• Extension: Signatures visible to SW through ISA
– Enables pervasive monitoring
– Enables novel compiler opts
Many novel programming/compiler/tool opportunities
![Page 16: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/16.jpg)
Josep TorrellasThe BULK Multicore Architecture
16
Rest of the Talk
• The Bulk Multicore• How it improves programmability• What’s next?
![Page 17: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/17.jpg)
Josep TorrellasThe BULK Multicore Architecture
17
Supports Sequential Consistency (SC)
• Correctness tools assume SC:
– Verification tools that prove software correctness
• Under SC, semantics for data races are clear:
– Easy specifications for safe languages
• Much easier to debug parallel codes (and design debuggers)
• Works with “hand-crafted” synchronization
![Page 18: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/18.jpg)
Josep TorrellasThe BULK Multicore Architecture
18
Deterministic Replay of MP Execution
• During Execution: HW records into a log the order of dependences between threads
• The log has captured the “interleaving” of threads
• During Replay: Re-run the program
– Enforcing the dependence orders in the log
![Page 19: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/19.jpg)
Josep TorrellasThe BULK Multicore Architecture
19
Conventional Schemes
P1
Wa
Wb
P2
Ra
Wb
n2 m1
m2
n1 P2’s Log
P1 n1 m1
P1 n2 m2
• Potentially large logs
![Page 20: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/20.jpg)
Josep TorrellasThe BULK Multicore Architecture
20
Bulk: Log Necessary is Minuscule [ISCA08]
P1WaWb
P2
Rc
RbWc
chunk1
chunk2
Wa
Combined Log = NIL
If we fix the chunk commit interleaving:
Combined Log of all Procs:
P1
P2
Pi
• During Execution:
– Commit the instructions in chunks, not individually
![Page 21: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/21.jpg)
Josep TorrellasThe BULK Multicore Architecture
21
Data Race Detection at Production-Run Speed
Unlock L
Unlock L
Lock LLock L • If we detect communication between…
– Ordered chunks: not a data race
– Unordered chunks: data race
[ISCA03]
![Page 22: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/22.jpg)
Josep TorrellasThe BULK Multicore Architecture
22
Different Synchronization Ops
Unlock L
Unlock L
Lock LLock L
Lock
Set F
Wait F
Flag
Barrier
Barrier
Barrier
![Page 23: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/23.jpg)
Josep TorrellasThe BULK Multicore Architecture
23
Benefits of Bulk Multicore (III)
• Extension: Signatures visible to SW through ISA
– Enables pervasive monitoring [ISCA04]
Support numerous watchpoints for free
– Enables novel compiler opts [ASPLOS08]
Function memoization
Loop-invariant code motion
![Page 24: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/24.jpg)
Josep TorrellasThe BULK Multicore Architecture
24
Pervasive Monitoring:Attaching a Monitor Function to Address
instr
instrinstr
*p = ...
instr
instr
instr
instr
Watch(addr, usr_monitor)
usr_monitor(Addr){
…..
}
• Watch memory location
• Trigger monitoring function when it is accessed
Rest of MonitoringFunctionProgram
Main
Thread
Program
*p=
![Page 25: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/25.jpg)
Josep TorrellasThe BULK Multicore Architecture
25
Enabling Novel Compiler Optimizations
New instruction: Begin/End collecting addresses into sig
![Page 26: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/26.jpg)
Josep TorrellasThe BULK Multicore Architecture
26
Enabling Novel Compiler Optimizations
New instruction: Begin/End collecting addresses into sig
![Page 27: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/27.jpg)
Josep TorrellasThe BULK Multicore Architecture
27
Instruction: Begin/End Disambiguation Against Sig
![Page 28: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/28.jpg)
Josep TorrellasThe BULK Multicore Architecture
28
Instruction: Begin/End Disambiguation Against Sig
![Page 29: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/29.jpg)
Josep TorrellasThe BULK Multicore Architecture
29
Instruction: Begin/End Remote Disambiguation
![Page 30: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/30.jpg)
Josep TorrellasThe BULK Multicore Architecture
30
Instruction: Begin/End Remote Disambiguation
![Page 31: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/31.jpg)
Josep TorrellasThe BULK Multicore Architecture
31
Example Opt: Function Memoization
• Goal: skip the execution of functions
![Page 32: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/32.jpg)
Josep TorrellasThe BULK Multicore Architecture
32
• Goal: skip the execution of functions whose outputs are known
Example Opt: Function Memoization
![Page 33: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/33.jpg)
Josep TorrellasThe BULK Multicore Architecture
33
Example Opt: Loop-Invariant Code Motion
![Page 34: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/34.jpg)
Josep TorrellasThe BULK Multicore Architecture
34
Example Opt: Loop-Invariant Code Motion
![Page 35: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/35.jpg)
Josep TorrellasThe BULK Multicore Architecture
35
Rest of the Talk
• The Bulk Multicore• How it improves programmability• What’s next?
![Page 36: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/36.jpg)
Josep TorrellasThe BULK Multicore Architecture
36
What is Going On?
Bulk MulticoreHardware
Architecture
Compiler (Matt Frank et al)
Libraries and run time systems
Language support(Marc Snir, Vikram Adve et al)
Debugging Tools(Sam King, Darko Marinov et al)
FPGA Prototype (at Intel)
![Page 37: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/37.jpg)
Josep TorrellasThe BULK Multicore Architecture
37
Summary: The Bulk Multicore for Year 2015-2018
• 128+ cores/chip, shared-memory (perhaps in groups)
• Simple HW with commodity cores– Memory consistency checks moved away from the core
• High performance shared-memory programming model– Execution in programmer-transparent chunks – Signatures for disambiguation, cache coherence, and compiler opts – High-performance sequential consistency with simple HW
• High programmability: Sophisticated always-on debugging support– Deterministic replay of parallel programs with no log (DeLorean)– Data race detection for production runs (ReEnact)– Pervasive program monitoring (iWatcher)
![Page 38: Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability](https://reader035.vdocuments.site/reader035/viewer/2022062712/56649c775503460f9492b96e/html5/thumbnails/38.jpg)
Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois
The Bulk Multicore Architecture for Programmability
Josep Torrellas
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu