a heterogeneous lightweight multithreaded architecture sheng li, amit kashyap, shannon kuntz, jay...

29
A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA

Upload: clifford-baldwin

Post on 29-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

A Heterogeneous Lightweight Multithreaded

Architecture

A Heterogeneous Lightweight Multithreaded

Architecture

Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,

Paul Springer, and Gary Block

University of Notre Dame

MTAAP 2007,CA

Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge,

Paul Springer, and Gary Block

University of Notre Dame

MTAAP 2007,CA

Page 2: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Outline

Heterogeneous Lightweight Multithreaded Architecture

Simulation environments, benchmarks and results

Conclusions and future work

Page 3: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Architecture Highlights

Processing-In-Memory(PIM) Based

Effectively attack memory wall problem

Highly multithreaded

Successfully hide large latencies and contentions

Heterogeneous, Supports Extended Memory Semantics (EMS)

Extremely low overhead on context switch and synchronization

Page 4: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Multithreaded Processors

Multithreading reduces the processor idle time

Thread context is part of the processor

Multithreading Machines

1960s CDC 6600

1970s I/O Processor for the Space Shuttle

1980s Denelcor HEP

1990s Cray/Tera MTA

2000+ Cray Eldorado

2000+ Intel Xeon

2000+ Sun NiagaraSingle Threaded Multithreaded

Page 5: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Lightweight Threads

Thread context (frame) is 32 double words (256 bytes) Two double words are reserved for the thread status; 30 general

purpose registers.

No other per thread state, easy for multithreading .

Frames are stored in memory (No Register File) Registers are aliases for memory locations

Page 6: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Lightweight Multithreading

Thread creation is fast and inexpensive - single instruction Contrast with pthread creation - kernel intervention and as

many as 10,000’s of instructions

Unbounded Multithreading

Threads are part of the memory system rather than the processor state.

“Unlimited number” of threads per processor. Many opportunities for issuing an instruction.

Ultra-lightweight Processing Unbounded Multithreading requires low overhead thread

management and synchronization At the memory bank, Greater data bandwidth, Low overhead

Page 7: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Heterogeneous Architecture

Heterogeneous Architecture Lightweight Processor Chip (LPC)

Issue instruction from ready threads on each clock cycle

Architectural support for low overhead thread management

Page 8: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Extended Memory Semantics (EMS)

Memory subsystem is constructed of 65 bit dwords 64 bits of data 1 extension bit; 1: dword is Full, 0: dword is empty

Extends Cray MTA E/F bits Full/Empty: Contains data or not Extra states: Metadata can contain frame pointer

Same semantics apply to thread registers

64 bits of data/metadata

Extension bit

Page 9: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Single Producer/ Consumer on EMS

LWP behavior for load_fe with A empty.

Location A changes state to “FVE: forward value, leave empty”

Content of A is the target address of the forward operation (all registers also have a memory address).

Page 10: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Completing the Load

How does the LWP complete the load_fe?

store_ef arrives at A

Data associated with store is returned to T2:R2 – this completes the load_fe

Location A changes to the empty state.

Page 11: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

A More Complex Situation

Consider a multiple producer/consumer problem such as locks. Multiple threads (more than 3) all attempt to acquire the lock.

Memory requests will be queued up at the target location

EMS handler thread needed to handle the bookkeeping

Page 12: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

EMS Handler Overhead

Invoking a EMS handler Synchronized memory operations beyond the hardware

supported single producer/consumer scenario

Overhead Creating the handler threads

To queue up memory requests, handlers need to spin on the target memory address to get exclusive access

Significant overhead on LWP CPU time, NoC traffic and memory bandwidth

How to alleviate the overhead?

Page 13: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Ultra-Lightweight Processor

Alleviate burden from LWP

For thread synchronization and management, Complex atomic memory operations

Simple design, Minimal circuitry

At the memory bank, Greatest data bandwidth (wide-word), no NoC traffic when accessing memory.

Multithreaded

Page 14: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Large-scale system

Large-scale system

Page 15: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Outline

Heterogeneous Lightweight Multithreaded Architecture

Simulation environments, benchmarks and results

Conclusion and future work

Page 16: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Simulation Environment

DimC – Diminished C

- An extension of the ANSI C

- Expose low level architectural features

- Support lightweight multithreading

SALT -Simulator for the Analysis of LWP Timings

-Contains LWPs, ULWPs, NoC and memory subsystems.

Page 17: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Benchmark Suite

Two categories of irregular problems. Complicated control structures such as recursion.

Such programs can achieve decent performance on conventional architectures but need great effort.

Not necessarily Invoking EMS handler or ULWP

N-Queens, Fibonacci

Complicated control structures and dynamic data structures

Very hard to parallelize effectively on conventional SMPs.

EMS handler or ULWP support is necessary

Competing agents, SAT solver kernel

Page 18: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

N-Queens

Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another.

Irregular problems with dynamic parallel recursion ,

Thread behavior is hard to predict.

Page 19: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Competing Agents

Multiple agents attempt to update a shared memory location simultaneously

Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC

Complicated control structures and dynamic data structures

Using separate synchronized load/stores

To characterize the effectiveness of the ULWP in reducing the cost of synchronization.

Page 20: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

SAT Solver/zChaff

SAT-Boolean satisfiability problem (from propositional

logic) fundamental to many problems in automated reasoning, CAD,

CAM, machine vision, database, robotics, IC design, computer architecture, and network design.

Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true.

For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2.

zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.

21 3 1 3( ) ( ) ( )X X X X X

Page 21: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

N-Queens

Successfully deploy all the parallelism

Completely dynamic, Ideal speedup

Saturation is only due to small data set

Good performance can be achieved on conventional SMPs but need great extra effort

Page 22: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Competing Agents

EMS handler is the bottleneck in high contention situation

Heterogeneous architecture can achieve unbounded scalability

High contention is not a problem any more in the heterogeneous architecture

Page 23: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

SAT Solver/zChaff on Conventional SMPs

Parallel implementation lead to performance degeneration

The more processors, the worse performance

Very hard to achieve good performance on conventional SMPs

Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel

Page 24: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

SAT Solver/zChaff on Heterogeneous architecture

Ideal speedup

saturation is only due to small data set

Successfully deployed all the parallelism

Speedup Speedup Over serial version

Page 25: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Outline

Heterogeneous Lightweight Multithreaded Architecture

Simulation environments, benchmarks and results

Conclusions and future work

Page 26: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Conclusions

The Heterogeneous Lightweight Multithreaded Architecture is a good solution for irregular problem that are

hard/impossible to parallelize over conventional SMPs

Has very low overhead on context switching and synchronization

Can successfully hide latencies and contentions

Can provide unbounded multithreading and scalability Can deploy all possible parallelism inside an irregular problem

Page 27: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Future Work

Provide standard language support

Benchmark suites

Large-scale system performance

Comparison with conventional large-scale systems

Page 28: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Acknowledgments

DARPA This material is based upon work supported by the Defense

Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003.

University of Notre Dame Caltech/JPL Cray

Page 29: A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block

MTAAP 2007, Long Beach, CA

Thank you!