computer science and engineering piranha: a scalable architecture based on single- chip...
TRANSCRIPT
Computer Science and Engineering
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing
Barroso, Gharachorloo, McNamara, et. Al
Proceedings of the 27th Annual ISCA, June 2000
Presented by Wael KdouhSpring 2006
Professor: Dr. Hisham El Rewini
Computer Science and Engineering
Motivation
Economic: High demand for OLTP(on-line transaction processing) machines Disconnect between ILP-focus and this demand OLTP
-- High memory latency
-- Little ILP (Get, process, store)
-- Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines
and low development costs and time to market Short wires as opposed to costly and slow long wires that can affect cycle time Amdahl’s Law
Computer Science and Engineering
Other Innovations
The design of the shared second-level cache uses sophisticated protocol that does not enforce inclusion in first-level instruction and data caches in order to maximize utilization of on-chip caches.
The cache coherence protocol among nodes incorporates a number of unique features that result in fewer protocol messages and lower protocol engine occupancies compared to other designs.
It has a unique I/O architecture, with an I/O node that is full-fledged member of the interconnect and the global shared memory-memory coherence protocol.
Computer Science and Engineering
The Piranha Processing Node
Separate I/D L1(64KB, 2-way set-associative) for each CPU.Logically shared interleaved L2 cache(1MB). Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec.
180 nm process (2000)Almost entirely ASIC design50% clock speed, 200% area versus full-custom methodology
CPU:AlphaECE152 workSingle in-order8-stage pipeline500 Mhz
Intra-Chip Switch, is aUnidirectional crossbar
Computer Science and Engineering
Communication Assist
+ Home Engine (Exporting) and Remote Engine (Importing) support shared memory across multiple nodes
+ System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc.
+ OQ, Router, IQ, Switch standardwhich links multiple Piranha chips
+Total inter-node I/O Bandwidth : 32 GB/sec
+ Each link and block here corresponds to actual wiring and module.
THERE IS NO INHERENTI/O CAPABILITY.
Computer Science and Engineering
I/O Organization
Smaller than processing node
Router 2 links, alleviates need for routing table
Memory is globally visible and part of coherency scheme
CPU optimized placement for drivers, translations etc. with low-latency access needs to I/O.
Re-used dL1 design provides interface to PCI/X interface
Supports arbitrary I/O:P ratio, network topology
Glueless scaling up to 1024 nodes of any type supports application specific customization
Computer Science and Engineering
Piranha System
Computer Science and Engineering
Coherence: Local L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory
Chip ICS responsible for all on-chip communication
L2 is “non-inclusive”.
“Large victim buffer” for L1s. Keeps tags and state copies of L1 data
The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.
L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.
L2 on forwards blocks conflicting requests
Computer Science and Engineering
Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity leaves 44 bits per 64 bit line)
Invalidation-based distributed directory protocol
Some optimizations
No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies
Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges.
Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token).
Computer Science and Engineering
Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)
Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware
“Fudged” for full-custom effect
Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
Computer Science and Engineering
Parameters for different processor designs.
Results
Computer Science and Engineering
Performance Evaluation
OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared:
Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz
Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x
Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO
Computer Science and Engineering
Questions/Discussion
Evaluation methodology? Would the Piranha design be worthwhile if there were a
well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per
processor? Power consumption?
The authors maintain that : 1) The use of chip multiprocessing is inevitable in future
microprocessor designs. 2) As more transistors become available, further increasing on-chip cache sizes or building more complex cores will only lead to diminishing performance gains and possibly longer design cycles.
Given the enormous emphasis that Intel engineers are plcaing on massive L2 caches, Intel engineers appear to disagree.
Given the huge investment that both Intel and Compaq/HP have
put into the Itanium family, and the fact that the Alpha is a moribund architecture , it is unlikely that the innovative Piranha microprocessor will ever see the light of day.
Conclusion
Computer Science and EngineeringNo more penguins to eat………………
Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer Architecture—All That's Old is New Again,” that he finds most of the performance-improvement advances in computer micro-architecture have been based on the exploitation of only two ideas: locality and pipelining.
In my personal opinion the upcoming years are going to exploit two ides: SMT and CMP.
The Future
Computer Science and Engineering
Questions