computer science and engineering piranha: a scalable architecture based on single- chip...

Computer Science and Engineering

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Barroso, Gharachorloo, McNamara, et. Al

Proceedings of the 27th Annual ISCA, June 2000

Presented by Wael KdouhSpring 2006

Professor: Dr. Hisham El Rewini


Motivation

Economic: High demand for OLTP(on-line transaction processing) machines Disconnect between ILP-focus and this demand OLTP

-- High memory latency

-- Little ILP (Get, process, store)

-- Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines

and low development costs and time to market Short wires as opposed to costly and slow long wires that can affect cycle time Amdahl’s Law


Other Innovations

The design of the shared second-level cache uses sophisticated protocol that does not enforce inclusion in first-level instruction and data caches in order to maximize utilization of on-chip caches.

The cache coherence protocol among nodes incorporates a number of unique features that result in fewer protocol messages and lower protocol engine occupancies compared to other designs.

It has a unique I/O architecture, with an I/O node that is full-fledged member of the interconnect and the global shared memory-memory coherence protocol.


The Piranha Processing Node

Separate I/D L1(64KB, 2-way set-associative) for each CPU.Logically shared interleaved L2 cache(1MB). Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec.

180 nm process (2000)Almost entirely ASIC design50% clock speed, 200% area versus full-custom methodology

CPU:AlphaECE152 workSingle in-order8-stage pipeline500 Mhz

Intra-Chip Switch, is aUnidirectional crossbar


Communication Assist

+ Home Engine (Exporting) and Remote Engine (Importing) support shared memory across multiple nodes

+ System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc.

+ OQ, Router, IQ, Switch standardwhich links multiple Piranha chips

+Total inter-node I/O Bandwidth : 32 GB/sec

+ Each link and block here corresponds to actual wiring and module.

THERE IS NO INHERENTI/O CAPABILITY.


I/O Organization

Smaller than processing node

Router 2 links, alleviates need for routing table

Memory is globally visible and part of coherency scheme

CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O.

Re-used dL1 design provides interface to PCI/X interface

Supports arbitrary I/O:P ratio, network topology

Glueless scaling up to 1024 nodes of any type supports application specific customization


Piranha System


Coherence: Local L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory

Chip ICS responsible for all on-chip communication

L2 is “non-inclusive”.

“Large victim buffer” for L1s. Keeps tags and state copies of L1 data

The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.

L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.

L2 on forwards blocks conflicting requests


Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity leaves 44 bits per 64 bit line)

Invalidation-based distributed directory protocol

Some optimizations

No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies

Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges.

Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token).


Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)

Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware

“Fudged” for full-custom effect

Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)


Parameters for different processor designs.

Results


Performance Evaluation

OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared:

Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz

Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x

Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO


Questions/Discussion

Evaluation methodology? Would the Piranha design be worthwhile if there were a

well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per

processor? Power consumption?

The authors maintain that : 1) The use of chip multiprocessing is inevitable in future

microprocessor designs. 2) As more transistors become available, further increasing on-chip cache sizes or building more complex cores will only lead to diminishing performance gains and possibly longer design cycles.

Given the enormous emphasis that Intel engineers are plcaing on massive L2 caches, Intel engineers appear to disagree.

Given the huge investment that both Intel and Compaq/HP have

put into the Itanium family, and the fact that the Alpha is a moribund architecture , it is unlikely that the innovative Piranha microprocessor will ever see the light of day.

Conclusion

http://www.intel.com/products/server/processors/server/itanium2/index.htm?iid=sr+itanium&



Computer Science and EngineeringNo more penguins to eat………………

Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer Architecture—All That's Old is New Again,” that he finds most of the performance-improvement advances in computer micro-architecture have been based on the exploitation of only two ideas: locality and pipelining.

In my personal opinion the upcoming years are going to exploit two ides: SMT and CMP.

The Future


Questions

computer science and engineering piranha: a scalable architecture based on single- chip...

Documents