© 2007 ibm corporation all statements regarding ibm future directions and intent are subject to...

2007 IBM Corporation All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Processor and Systems Jim McInnes [email protected] Compiler Optimization IBM Canada Toronto Software Lab Slide 2 2007 IBM Corporation IBM POWER6 Overview2 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Role I am a Technical leader in the Compiler Optimization Team Focal point to the hardware development team Member of the Power ISA Architecture Board For each new microarchitecture I help direct the design toward helpful features Design and deliver specific compiler optimizations to enable hardware exploitation Slide 3 2007 IBM Corporation IBM POWER6 Overview3 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 Chip Overview High frequency dual-core chip 8 execution units 2LS, 2FP, 2FX, 1BR, 1CR 1.9MB on-chip shared L2 point of coherency, 3 slices On-chip L3 directory and controller On-chip memory controller Technology & Chip Stats 130nm lithography, Cu, SOI 276M transistors, 389 mm 2 I/Os: 2313 signal, 3057 Power/Gnd Slide 4 2007 IBM Corporation IBM POWER6 Overview4 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Ultra-high frequency dual-core chip 8 execution units 2LS, 2FP, 2FX, 1BR, 1VMX 2 x 4MB on-chip L2 point of coherency, 4 quads On-chip L3 directory and controller Two on-chip memory controllers Technology & Chip stats CMOS 65nm lithography, SOI Cu 790M transistors, 341 mm 2 die I/Os: 1953 signal, 5399 Power/Gnd Full error checking and recovery (RU) IFU LSU SDU FXU RU VMX FPU L2 QUAD L2 QUAD L2 QUAD L2 QUAD IFU LSU SDU FXU RU VMX FPU L2 CNTL L2 CNTL MC L3 CNTL L3 CNTL GXCFBC Core0 Core1 POWER6 Chip Overview Slide 5 2007 IBM Corporation IBM POWER6 Overview5 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER4 / POWER5 / POWER6 POWER5 Core Mem Ctl L3 Ctl Enhanced Distributed Switch GX Bus L2 POWER5 Core Memory L2 Memory Mem Ctl L3 Cntrl POWER4 Core POWER4 Core GX Bus Distributed Switch POWER4POWER5 Memory+ GX+ Bridge P6 Core P6 Core GX Bus Cntrl Memory Cntrl Memory Cntrl Fabric Bus Controller Memory+ Alti Vec Alti Vec 4 MB L2 4 MB L2 L3 Ctrl L3 Ctrl L3 POWER6 Slide 6 2007 IBM Corporation IBM POWER6 Overview6 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only SMT Core 0 Memory Cntlr 0 Fabric Switch I/O Interface: 4B Read, 4B Write at 2-8:1 the pc On-Node Fabric Buses (3 pairs): 2B/bc or 8B/bc per unidirectional pair Off-Node Fabric Buses (2 pairs): 4B/bc or 8B/bc per unidirectional pair Buses scale at 2:1 with core frequency 2 MB L2 Quad 0 (Core0) 2 MB L2 Quad 1 (Core0) SMT Core 1 2 MB L2 Quad 0 (Core 1) I/O Ctlr 2 MB L2 Quad 2 (Core 1) 2 MB L2 Quad 3 (Core 1) Memory Cntlr 1 L3 Ctrl Dir L3 Ctrl Dir I/O Cntlr L2 Ctrl Dir L2 Ctrl Dir 32 MB 1 L3 (2x16MB) L2 reload: 32B/Pc Memory port (1 of 2): 8B Read (4x2B), 4B Write (4x1B) L3 buses: 16B/bc Read 16B/bc Write (Split into two, 8 and 8) pc = processor clock bc = bus clock 2 pc = 1 bc (same as MC0) DRAM Memory connected by up to 4 channels, 533 800MHz DIMMS DDR2 (channels run at 4X DRAM frequency) Total L2 on chip: 8MB 64KB L1D 1 May be a single 32MB L3 chip with 8B buses POWER6 I/O: Speeds and Feeds Slide 7 2007 IBM Corporation IBM POWER6 Overview7 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Core Pipelines Slide 8 2007 IBM Corporation IBM POWER6 Overview8 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER5 vs. POWER6 Pipeline Comparison Slide 9 2007 IBM Corporation IBM POWER6 Overview9 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Side-by-Side comparisons POWER5+POWER6 StyleGeneral out-of-order execution Mostly in-order with special case out-of-order execution Units2FX, 2LS, 2FP, 1BR, 1CR 2FX, 2LS, 2FP, 1BR/CR, 1VMX Threading 2 SMT threads Alternate ifetch Alternate dispatch (up to 5 instructions) 2 SMT threads Priority-based dispatch Simultaneous dispatch from two threads (up to 7 instructions) Slide 10 2007 IBM Corporation IBM POWER6 Overview10 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Side-by-side comparisons POWER5+POWER6 L1 Cache ICache capacity, associativity64 KB, 2-way64 KB, 4-way DCache capacity, associativity32 KB, 4-way64 KB, 8-way L2 CachePoint of Coherency Capacity, line size1.9 MB, 128 B line2 x 4 MB, 128 B line Associativity, replacement10-way, LRUeach 8-way, LRU Off-chip L3 Cache Capacity, line size36 MB, 256 B line32 MB, 128 B line Associativity, replacement12-way, LRU16-way, LRU Memory Memory bus 4 TB maximum 2x DRAM frequency 8 TB maximum 4x DRAM frequency Slide 11 2007 IBM Corporation IBM POWER6 Overview11 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Latency Profiles L2 L3 Local Memory Slide 12 2007 IBM Corporation IBM POWER6 Overview12 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Processor Core Functional Features PowerPC AS Architecture Simultaneous Multithreading ( 2 threads) Decimal Floating Point (extension to PowerPC ISA) 48 Bit Real Address Support Virtualization Support (1024 partitions) Concurrent support of 4 page sizes (4K, 64K, 16M, 16G) New storage key support Bi-Endian Support Dynamic Power Management Single Bit Error Detection on all Dataflow Robust Error Recovery (R unit) CPU sparing support (dynamic CPU swapping) Common Core for I, P, and Blade Slide 13 2007 IBM Corporation IBM POWER6 Overview13 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Achieving High Frequency: POWER6 13FO4 Challenge Example Circuit Design 1 FO4 = delay of 1 inverter that drives 4 receivers 1 Logical Gate = 2 FO4 1 cycle = Latch + function + wire 1 cycle = 3 FO4 + function + 4 FO4 Function = 6 FO4 = 3 Gates Integration It takes 6 cycles to send a signals across the core Communication between units takes 1 cycle using good wire Control across a 64-bit data flow takes a cycle Slide 14 2007 IBM Corporation IBM POWER6 Overview14 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Sacrifices associated with high frequency In-order design instead of out-of-order Longer pipelines Fp-compare latency FX multiply done in FPU Memory bandwidth reduced Esp. Store queue draining bandwidth SMT implementation is improved Mitigating this: Slide 15 2007 IBM Corporation IBM POWER6 Overview15 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only In-Order vs. Out of Order Fetch Dispatch Functional Unit Instruction Queue Functional Unit Instruction Queue Fetch Dispatch Functional Unit Functional Unit In-Order Dispatch Out-of-Order Execution In-Order Dispatch In-Order Execution STQ Operands must be available Operands must be available Load R2, Addi R2,R2,1 Load R3, Slide 16 2007 IBM Corporation IBM POWER6 Overview16 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only P5 vs. P6 instruction groups Both machines are superscalar, meaning that instructions are formed into groups and dispatched together. There are two big differences between P5 and P6: On P5 there are few restrictions on what types on insns and how many can be grouped together, while on P6 each group must fit in the available resources (2 fp, 2 fx, 2 ldst,..) On P5, insns are placed in queues to wait for operands to be ready, while on P6 dispatch must wait until all operands are ready for the whole group On p6, cycles are shorter, but there are more stall cycles and more partial instruction groups Slide 17 2007 IBM Corporation IBM POWER6 Overview17 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Strategic NOP insertion ORI 1,1,0 terminates a dispatch group early Suppose at cycle 10 the current group contains FMA, FMA, LFL and all are ready to go. And the only other available instruction is a LFL ready in cycle 15 Including the LFL holds up the first three Could be harmless or catastrophic Sometimes Nops are inserted to fix this Slide 18 2007 IBM Corporation IBM POWER6 Overview18 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only FP compares Fp compare has an 11-cycle latency to its use This is a significant delay the best defence is for the programmer to work around it arranging lots of work between a compare and where it is used. Several branch free sequences are available for such code. The compiler will in some cases try to replace fp compares with fx compares. Slide 19 2007 IBM Corporation IBM POWER6 Overview19 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Store Queue optimizations On Power6 stores are placed into a queue to await completion. This queue can drain 1 store every other cycle If the next store in the queue is to consecutive storage with the first, they both can go that cycle. A 2*8 byte fp store is provided to make this more convenient Slide 20 2007 IBM Corporation IBM POWER6 Overview20 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Symmetric Multithreading (SMT) P6 operates in two modes, ST (single thread) and SMT (multithreaded) In SMT mode two independent threads execute simultaneously, possibly from the same parallel program Instructions from both threads can dispatch in the same group, subject to unit availability This is highly profitable on P6 it is a good way to fill otherwise empty machine cycles and achieve better resource usage Slide 21 2007 IBM Corporation IBM POWER6 Overview21 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only FPU details Two Binary Floating Point units and one Decimal Floating Point unit 7 stage FPU with bypass in 6 stages locally and 8 stages to other fpu Interleaves execution of independent arithmetic instructions with divide/sqrt Handles fixed point multiply/divide (pipelined) Always run at or behind other units 10 group entry queues (group= 2ldf and 2 fp/stf instructions) Allows pipelined issue of dependent store instructions 10 group entry load target buffer allowing speculative FP load execution Multiple Interrupt modes Ignore, imprecise non-recoverable, imprecise recoverable, precise Slide 22 2007 IBM Corporation IBM POWER6 Overview22 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only FPU: more detail shift align/mul add normalize round writeback align/mul add shift align/mul add normalize round writeback align/mul add xmit dep select issue isq steer xmit dep select issue RF isq steer RF FP Registers RF gprs agen CA xmit dispatch fmt Load Target Buffer FP loads FP arithmetics/stores Store Data to LSU Slide 23 2007 IBM Corporation IBM POWER6 Overview23 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Linpack HPC Performance Slide 24 2007 IBM Corporation IBM POWER6 Overview24 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Datastream Prefetching in POWER6 Load-Look-Ahead (LLA) mode Triggered on a data cache miss to hide latency Datastream prefetching with no filtering Sixteen (16) Prefetch Request Queues (PRQs) Prefetch 2 lines ahead to L1, up to 24 to L2, demand paced Speculative L2 prefetch for next line on cache miss Depth/aggressiveness control via SPR and dcbt instructions int dscr_ctl() dscrctl change O/S default Store prefetching into L2 (with hardware detection) Full compliment of line and stream touches for both loads and stores Sized for dual-thread performance (8 PRQs/thread) Slide 25 2007 IBM Corporation IBM POWER6 Overview25 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Memory Physical Organization DIMMs populated in quads Single NOVA per DIMM (or pair of DIMMS) DIMMs may be 1/2/4 ranks Actual bandwidth depends on: DIMM speed, type # of ranks/DIMMs/channel DIMM channel b/w MC read/write bandwidth (from/to buffers) pattern of reads/writes presented to MC Slide 26 2007 IBM Corporation IBM POWER6 Overview26 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip POWER6 Chip POWER6 Chip POWER6 Chip POWER5 vs POWER6 bus architecture POWER5 Chip POWER5 Chip POWER5 Chip POWER5 Chip Data buses Address buses Combined buses Slide 27 2007 IBM Corporation IBM POWER6 Overview27 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Chip L3 chip L3 chip POWER6 Chip L3 chip L3 chip POWER6 Chip L3 chip L3 chip POWER6 Chip L3 chip L3 chip Off-node SMP interconnect Node Building block for larger systems 64-way system has 8 nodes Eight off-node SMP links Fully connects 8 nodes in a 64-way Additional link for RAS Can be used to build 128-core SMP POWER6 Eight-Processor Node z y x a & b Slide 28 2007 IBM Corporation IBM POWER6 Overview28 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 64 way high-end system Slide 29 2007 IBM Corporation IBM POWER6 Overview29 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only B-bus XYZ-bus A-bus 1 3 0 2 9 11 8 10 13 15 1214 5 6 4 Quad 0 Quad 2 Quad 3 Quad 1 Interconnect GX IOC GX POWER6 32-way Slide 30 2007 IBM Corporation IBM POWER6 Overview30 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Slide 31 2007 IBM Corporation IBM POWER6 Overview31 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only HV8 X X X X Slide 32 2007 IBM Corporation IBM POWER6 Overview32 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Most significant microarchitectural enhancement to P6 nest Coherence broadcasts may be speculatively limited to subset (node or possibly chip) of SMP NUMA-directory-like combination of cache states and domain state info in memory provide ability to resolve coherence without global broadcast Provides significant leverage for SW that exhibits node or chip level processor/memory affinity Several system designs significantly cost reduced by cutting bandwidth of chip interconnect Snoop Filtering: Broadcast Localized Adaptive-Scope Transaction Protocol (BLAST) Slide 33 2007 IBM Corporation IBM POWER6 Overview33 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Snoop Filtering: Local/System/Double Pump Send request to entire system (System Pump) Longer latency within local node Shorter latency to remote nodes burn address bandwidth on all nodes Send request to local node only (Local Pump) better latency if data is found locally uses only local XYZ address bandwidth re-send to entire system if not found (Double Pump) Adds 100-200 pclks latency (depends on system type) Local predictor uses thread ID, request type, memory address Slide 34 2007 IBM Corporation IBM POWER6 Overview34 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only Tracking off-node copies T/Te/Tn/Ten: e=not dirty / n=no off-node copies In/Ig: on- or off-node request took line away castout Ig state to memory directory bit Memory directory extra bit per cache line tells whether off node copies exist MC updates when sourcing data off-node allows node pump to work (MC can provide data) always know from Ig or mem dir if off-node copies exist Double pump No on-node cache has any info about line If memory directory tells us off-node copies exist with data return (data not valid), have to do second pump Slide 35 2007 IBM Corporation IBM POWER6 Overview35 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only POWER6 Summary Extends POWER leadership for both Technical and Commercial Computing Approx. twice the frequency of P5+, with similar instruction pipeline length and dependency 6 cycle FP to FP use, same as P5+ Mostly in-order but with OOO-like features (LTB, LLA) 5 instruction dispatch group from a thread, up to 7 for both threads, with one branch at any position Significant improvement in cache-memory-latency profile More cache with higher associativity Lower latency to all levels of cache and to memory Enhanced datastream prefetching with adjustable depth, store-stream prefetching Load look-ahead data prefetching Advanced simultaneous multi-threading gives added dimension to chip level performance Predictive subspace snooping for vastly increased SMP coherence throughput Continued technology exploitation CMOS 65nm, Cu SOI Technology All buses scale with processor frequency Enhanced Power Management Advanced server virtualization Advanced RAS: robust error detection, hardware checkpoint and recovery

© 2007 ibm corporation all statements regarding ibm future directions and intent are subject to...

Documents