64-bit scalable chip multiprocessor ( scmp)

64-bit Scalable Chip Multiprocessor

(SCMP)

Tongji University

Why SCMP ?

Memory access latency is bottleneck TLP is the trend Flexible, scalable from 1 to 4, up to 16 cores CPU core is small and simple, easier to verify Higher throughput Improve wafer utilization

How SCMP ?

Full custom 64-bit CPU core On-chip switch L2 cache and controller Hardware thread scheduler

SCMP Block Diagram

Multi-bank L2 Cache

None-Blocking Crossbar Switch

D$

Int. FPU

RF RF RF

Thread Scheduler

Crypto CoprocessorRF

I$

D$

Int. FPU

RF RF RF RF

I$

D$

Int. FPU

RF RF RF RF

I$

D$

Int. FPU

RF RF RF RF

I$

IO

4-Core-Architecture Feature

Target Application: Server 4 Multi-thread processor cores 4 MB L2 cache, multi-bank Non-blocking crossbar switch between cores and L2 cache

banks Directory based cache coherency Thread scheduler Reconfigurable crypto-coprocessor FB-DIMM memory controller (possibly)

Multi-thread Core Architecture

64-bit MIPS Instruction Set Architecture 4 thread, Coarse Multithreading, only one thread at a time 16 KB L1 instruction cache, 8 KB (or 16KB) data cache 5-8 stage pipeline Including Integer Unit, Floating Point Unit and L1 cache

64-bit CPU Core Feature

ST 90nm Technology High speed, 1 GHz Low power consumption Small die size Robust Used as hard core

Full custom:

64-bit CPU Core Feature

Coarse Multithreading Make the core design easier Small, simple core

Only one thread at a time Bottleneck : memory access When waiting for memory, thread switched

Totally 4 thread in a core Memory latency more severe in common multiprocessor Masking memory latency by switching thread

Multithreading:

Performance gap between processor and memory

Multithreading

Multithreading Multiprocessor

On-chip interconnection

Increasing memory bandwidth Possibly more than one core can access L2 cache

Make L2 cache higher associativity Easier switch design Optimized for low latency

Crossbar:

L2 Cache

Multi-banked Higher bandwidth

Multi- memory interface to main memory

InterfaceInterfaceInterfaceInterface

L2Bank 3

Crossbar

L1L1L1L1

L2Bank 2L2

Bank 1L2

Bank 0

Cache Coherency

Tracking the processors that have copies of the block

Tracking the states of data block in L2 cache Shared Uncached Exclusive

Directory-Based Cache Coherency:

Directory

Bank 3

Directory

Bank 2

Directory

Bank 1

Directory

Bank 0

L2Bank 3

Crossbar

L2Bank 2L2

Bank 1L2

Bank 0

Thread Scheduler

Dispatch threads Hardware logic coupled with OS

Thread switch When L1 cache miss

Load balance (hardware counter) L1 cache hit / miss Core pipeline idle

Configure crypto-coprocessor

Reconfigurable Crypto-coprocessor

Supporting coding and decoding symmetric algorithms: AES DES, 3DES, GDES RCx

Reconfigurable Crypto-coprocessor

Reconfigure controlled by Thread scheduler

CryptoCoprocesso

r

4-ThreadFull Custom

Core

4-ThreadFull Custom

Core

ThreadScheduler

4-ThreadFull Custom

Core

4-ThreadFull Custom

Core

OS / Software

Using commercial operation system, such as LINUX

Minimize OS/compiler modification Almost no change in OS/compiler

Optimizing Compiler to improve machine code efficiency

FB-DIMM (possibly)

Serial data path

Latency is managed with new channel features

Cost-effective

Server memory in the future

InterfaceInterfaceInterfaceInterface

FBDIMM Interface

Thank you !

64-bit scalable chip multiprocessor ( scmp)

Documents