64-bit scalable chip multiprocessor ( scmp)
DESCRIPTION
64-bit Scalable Chip Multiprocessor ( SCMP). Tongji University. Why SCMP ?. Memory access latency is bottleneck TLP is the trend Flexible, scalable from 1 to 4, up to 16 cores CPU core is small and simple, easier to verify Higher throughput Improve wafer utilization. How SCMP ?. - PowerPoint PPT PresentationTRANSCRIPT
64-bit Scalable Chip Multiprocessor
(SCMP)
Tongji University
Why SCMP ?
Memory access latency is bottleneck TLP is the trend Flexible, scalable from 1 to 4, up to 16 cores CPU core is small and simple, easier to verify Higher throughput Improve wafer utilization
How SCMP ?
Full custom 64-bit CPU core On-chip switch L2 cache and controller Hardware thread scheduler
SCMP Block Diagram
Multi-bank L2 Cache
None-Blocking Crossbar Switch
D$
Int. FPU
RF RF RF
Thread Scheduler
Crypto CoprocessorRF
I$
D$
Int. FPU
RF RF RF RF
I$
D$
Int. FPU
RF RF RF RF
I$
D$
Int. FPU
RF RF RF RF
I$
IO
4-Core-Architecture Feature
Target Application: Server 4 Multi-thread processor cores 4 MB L2 cache, multi-bank Non-blocking crossbar switch between cores and L2 cache
banks Directory based cache coherency Thread scheduler Reconfigurable crypto-coprocessor FB-DIMM memory controller (possibly)
Multi-thread Core Architecture
64-bit MIPS Instruction Set Architecture 4 thread, Coarse Multithreading, only one thread at a time 16 KB L1 instruction cache, 8 KB (or 16KB) data cache 5-8 stage pipeline Including Integer Unit, Floating Point Unit and L1 cache
64-bit CPU Core Feature
ST 90nm Technology High speed, 1 GHz Low power consumption Small die size Robust Used as hard core
Full custom:
64-bit CPU Core Feature
Coarse Multithreading Make the core design easier Small, simple core
Only one thread at a time Bottleneck : memory access When waiting for memory, thread switched
Totally 4 thread in a core Memory latency more severe in common multiprocessor Masking memory latency by switching thread
Multithreading:
Performance gap between processor and memory
Multithreading
Multithreading Multiprocessor
On-chip interconnection
Increasing memory bandwidth Possibly more than one core can access L2 cache
Make L2 cache higher associativity Easier switch design Optimized for low latency
Crossbar:
L2 Cache
Multi-banked Higher bandwidth
Multi- memory interface to main memory
InterfaceInterfaceInterfaceInterface
L2Bank 3
Crossbar
L1L1L1L1
L2Bank 2L2
Bank 1L2
Bank 0
Cache Coherency
Tracking the processors that have copies of the block
Tracking the states of data block in L2 cache Shared Uncached Exclusive
Directory-Based Cache Coherency:
Directory
Bank 3
Directory
Bank 2
Directory
Bank 1
Directory
Bank 0
L2Bank 3
Crossbar
L2Bank 2L2
Bank 1L2
Bank 0
Thread Scheduler
Dispatch threads Hardware logic coupled with OS
Thread switch When L1 cache miss
Load balance (hardware counter) L1 cache hit / miss Core pipeline idle
Configure crypto-coprocessor
Reconfigurable Crypto-coprocessor
Supporting coding and decoding symmetric algorithms: AES DES, 3DES, GDES RCx
Reconfigurable Crypto-coprocessor
Reconfigure controlled by Thread scheduler
CryptoCoprocesso
r
4-ThreadFull Custom
Core
4-ThreadFull Custom
Core
ThreadScheduler
4-ThreadFull Custom
Core
4-ThreadFull Custom
Core
OS / Software
Using commercial operation system, such as LINUX
Minimize OS/compiler modification Almost no change in OS/compiler
Optimizing Compiler to improve machine code efficiency
FB-DIMM (possibly)
Serial data path
Latency is managed with new channel features
Cost-effective
Server memory in the future
InterfaceInterfaceInterfaceInterface
FBDIMM Interface
Thank you !