a heterogeneous lightweight multithreaded architecture sheng li, amit kashyap, shannon kuntz, jay...
TRANSCRIPT
A Heterogeneous Lightweight Multithreaded
Architecture
A Heterogeneous Lightweight Multithreaded
Architecture
Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,
Paul Springer, and Gary Block
University of Notre Dame
MTAAP 2007,CA
Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,
Paul Springer, and Gary Block
University of Notre Dame
MTAAP 2007,CA
MTAAP 2007, Long Beach, CA
Outline
Heterogeneous Lightweight Multithreaded Architecture
Simulation environments, benchmarks and results
Conclusions and future work
MTAAP 2007, Long Beach, CA
Architecture Highlights
Processing-In-Memory(PIM) Based
Effectively attack memory wall problem
Highly multithreaded
Successfully hide large latencies and contentions
Heterogeneous, Supports Extended Memory Semantics (EMS)
Extremely low overhead on context switch and synchronization
MTAAP 2007, Long Beach, CA
Multithreaded Processors
Multithreading reduces the processor idle time
Thread context is part of the processor
Multithreading Machines
1960s CDC 6600
1970s I/O Processor for the Space Shuttle
1980s Denelcor HEP
1990s Cray/Tera MTA
2000+ Cray Eldorado
2000+ Intel Xeon
2000+ Sun NiagaraSingle Threaded Multithreaded
MTAAP 2007, Long Beach, CA
Lightweight Threads
Thread context (frame) is 32 double words (256 bytes) Two double words are reserved for the thread status; 30 general
purpose registers.
No other per thread state, easy for multithreading .
Frames are stored in memory (No Register File) Registers are aliases for memory locations
MTAAP 2007, Long Beach, CA
Lightweight Multithreading
Thread creation is fast and inexpensive - single instruction Contrast with pthread creation - kernel intervention and as
many as 10,000’s of instructions
Unbounded Multithreading
Threads are part of the memory system rather than the processor state.
“Unlimited number” of threads per processor. Many opportunities for issuing an instruction.
Ultra-lightweight Processing Unbounded Multithreading requires low overhead thread
management and synchronization At the memory bank, Greater data bandwidth, Low overhead
MTAAP 2007, Long Beach, CA
Heterogeneous Architecture
Heterogeneous Architecture Lightweight Processor Chip (LPC)
Issue instruction from ready threads on each clock cycle
Architectural support for low overhead thread management
MTAAP 2007, Long Beach, CA
Extended Memory Semantics (EMS)
Memory subsystem is constructed of 65 bit dwords 64 bits of data 1 extension bit; 1: dword is Full, 0: dword is empty
Extends Cray MTA E/F bits Full/Empty: Contains data or not Extra states: Metadata can contain frame pointer
Same semantics apply to thread registers
64 bits of data/metadata
Extension bit
MTAAP 2007, Long Beach, CA
Single Producer/ Consumer on EMS
LWP behavior for load_fe with A empty.
Location A changes state to “FVE: forward value, leave empty”
Content of A is the target address of the forward operation (all registers also have a memory address).
MTAAP 2007, Long Beach, CA
Completing the Load
How does the LWP complete the load_fe?
store_ef arrives at A
Data associated with store is returned to T2:R2 – this completes the load_fe
Location A changes to the empty state.
MTAAP 2007, Long Beach, CA
A More Complex Situation
Consider a multiple producer/consumer problem such as locks. Multiple threads (more than 3) all attempt to acquire the lock.
Memory requests will be queued up at the target location
EMS handler thread needed to handle the bookkeeping
MTAAP 2007, Long Beach, CA
EMS Handler Overhead
Invoking a EMS handler Synchronized memory operations beyond the hardware
supported single producer/consumer scenario
Overhead Creating the handler threads
To queue up memory requests, handlers need to spin on the target memory address to get exclusive access
Significant overhead on LWP CPU time, NoC traffic and memory bandwidth
How to alleviate the overhead?
MTAAP 2007, Long Beach, CA
Ultra-Lightweight Processor
Alleviate burden from LWP
For thread synchronization and management, Complex atomic memory operations
Simple design, Minimal circuitry
At the memory bank, Greatest data bandwidth (wide-word), no NoC traffic when accessing memory.
Multithreaded
MTAAP 2007, Long Beach, CA
Large-scale system
Large-scale system
MTAAP 2007, Long Beach, CA
Outline
Heterogeneous Lightweight Multithreaded Architecture
Simulation environments, benchmarks and results
Conclusion and future work
MTAAP 2007, Long Beach, CA
Simulation Environment
DimC – Diminished C
- An extension of the ANSI C
- Expose low level architectural features
- Support lightweight multithreading
SALT -Simulator for the Analysis of LWP Timings
-Contains LWPs, ULWPs, NoC and memory subsystems.
MTAAP 2007, Long Beach, CA
Benchmark Suite
Two categories of irregular problems. Complicated control structures such as recursion.
Such programs can achieve decent performance on conventional architectures but need great effort.
Not necessarily Invoking EMS handler or ULWP
N-Queens, Fibonacci
Complicated control structures and dynamic data structures
Very hard to parallelize effectively on conventional SMPs.
EMS handler or ULWP support is necessary
Competing agents, SAT solver kernel
MTAAP 2007, Long Beach, CA
N-Queens
Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another.
Irregular problems with dynamic parallel recursion ,
Thread behavior is hard to predict.
MTAAP 2007, Long Beach, CA
Competing Agents
Multiple agents attempt to update a shared memory location simultaneously
Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC
Complicated control structures and dynamic data structures
Using separate synchronized load/stores
To characterize the effectiveness of the ULWP in reducing the cost of synchronization.
MTAAP 2007, Long Beach, CA
SAT Solver/zChaff
SAT-Boolean satisfiability problem (from propositional
logic) fundamental to many problems in automated reasoning, CAD,
CAM, machine vision, database, robotics, IC design, computer architecture, and network design.
Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.
For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2.
zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.
21 3 1 3( ) ( ) ( )X X X X X
MTAAP 2007, Long Beach, CA
N-Queens
Successfully deploy all the parallelism
Completely dynamic, Ideal speedup
Saturation is only due to small data set
Good performance can be achieved on conventional SMPs but need great extra effort
MTAAP 2007, Long Beach, CA
Competing Agents
EMS handler is the bottleneck in high contention situation
Heterogeneous architecture can achieve unbounded scalability
High contention is not a problem any more in the heterogeneous architecture
MTAAP 2007, Long Beach, CA
SAT Solver/zChaff on Conventional SMPs
Parallel implementation lead to performance degeneration
The more processors, the worse performance
Very hard to achieve good performance on conventional SMPs
Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel
MTAAP 2007, Long Beach, CA
SAT Solver/zChaff on Heterogeneous architecture
Ideal speedup
saturation is only due to small data set
Successfully deployed all the parallelism
Speedup Speedup Over serial version
MTAAP 2007, Long Beach, CA
Outline
Heterogeneous Lightweight Multithreaded Architecture
Simulation environments, benchmarks and results
Conclusions and future work
MTAAP 2007, Long Beach, CA
Conclusions
The Heterogeneous Lightweight Multithreaded Architecture is a good solution for irregular problem that are
hard/impossible to parallelize over conventional SMPs
Has very low overhead on context switching and synchronization
Can successfully hide latencies and contentions
Can provide unbounded multithreading and scalability Can deploy all possible parallelism inside an irregular problem
MTAAP 2007, Long Beach, CA
Future Work
Provide standard language support
Benchmark suites
Large-scale system performance
Comparison with conventional large-scale systems
MTAAP 2007, Long Beach, CA
Acknowledgments
DARPA This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003.
University of Notre Dame Caltech/JPL Cray
MTAAP 2007, Long Beach, CA
Thank you!