power edge of network(tm) processor...– area: 410 mm2 – 100 array types, ~300 instances – 124...
TRANSCRIPT
© 2010 IBM Corporation
The IBM Power Edge of Network™ Processor:A wire-speed System-on-a-Chip with 16 Power™ cores / 64 threads and optimized HW acceleration
Jeffrey D. Brown Distinguished Engineer and IBM Academy of Technology MemberIBM [email protected]
Co-authors: Sandra Woodward, Brian Bass, Charlie Johnson
© 2010 IBM Corporation
Systems and Technology Group
2 IBM PowerEN™
Overview of IBM PowerEN™ Processor Chip 64-bit PowerPC Architecture
Virtualization Support
Dual DDR3 DRAM Controllers
Optimized Ethernet Offload Engine
Integrated PCI-Express bus
Cryptography Unit
Regular Expression Unit
XML Processing Unit
Compression Unit
Upward Scalability – 4 Chip
Downward scalability - Subset
PowerEN™ SoC Block Diagram
PowerEN™ has all basic IP needed for user/data plane and control plane t raffic processing
Pow
erB
us
DRAM Controller
Packet processing
engine
4 x 10GbE
MAC
PCIe
ExternalPowerBus
Crypto Engine
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
DRAM Controller
RegEx Engine
Compress Engine
XML Engine
© 2010 IBM Corporation
Systems and Technology Group
3 IBM PowerEN™
IBM PowerEN™: Targeted at the Edge-of-Network
Power efficient Throughput computing
– Database Acceleration
– Service Oriented Architecture Acceleration
– Secure Multi-tenant Cloud Computing
Enhanced processing of data payloads
– Low latency message for Financial Information Exchange
Deeper networking functions
– Cyber Security
– Network Intrusion Prevention
Application targeted at "smarter planet" solutions
– Compartmentalized streamed analytics
– Data Reduction in storage subsystems
© 2010 IBM Corporation
Systems and Technology Group
4 IBM PowerEN™
Chip statistics:– 45nm - ultra-low k BEOL– Area: 410 mm2– 100 array types, ~300 instances– 124 RLMs, 1044 instances– 1.43B Transistors– 3.2 million latches– 5 PLLs, 10 concurrent frequencies
Package Technology– 50mm FCPBGA (6) build-up layers – C4 Pitch: 148 x 185 mm pitch
Power Delivery– Deep Trench Chip Capacitors– Adaptive Power Supply Strategy– 8 voltages domains
IBM PowerEN™ Chip
© 2010 IBM Corporation
Systems and Technology Group
5 IBM PowerEN™
Interconnect Architecture
“All Peers” architecture– Accel. and I/O are first class citizens
Proven Power-Bus architecture– Independent CMD Network (one/cycle)
– Two north, two south 16B data busses
– ECC protected data paths
64 Byte Cache Line Cache Injection
– Packets flow to / from Caches
– New PBus commands
Pow
erB
us
DRAM Controller
Packet processing
engine
4 x 10GbE
MAC
PCIe
ExternalPowerBus
Crypto Engine
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
DRAM Controller
RegEx Engine
Compress Engine
XML Engine
1.75 GHz operation– Asynchronous connection to AT Nodes and accelerators via PBICs
– Synchronous connection to DRAM controllers
– Three 4B 2.5 GHz EI3 external links (1,2, or 4 chip systems)
© 2010 IBM Corporation
Systems and Technology Group
6 IBM PowerEN™
PowerPC™ Processing Element Architecture PowerPC 64 architecture – Embedded
Enhanced for Co-processor/I/O interface
4 way Fine Grained SMT
In Order Dispatch and Execution
2 way concurrent issue.
– 1 Integer + 1 FPU instruction per cycle – Different threads
Unified fixed point, ld/st, and branch unit
16KB L1 Data Cache – 4 Way Assoc.
16KB L1 Instr Cache – 8 Way Assoc.
12 Stage Pipe 27 FO4 design (7 XU, 5 IU, 6 FU)
Fully associative I and D – ERAT
MMU: 512 Entry TLB w/ Hardware Table Walk
Hypervisor / Virtualization - One logical partition per core
Pow
erB
us
DRAM Controller
Packet processing
engine
4 x 10GbE
MAC
PCIe
ExternalPowerBus
Crypto Engine
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
DRAM Controller
RegEx Engine
Compress Engine
XML Engine
© 2010 IBM Corporation
Systems and Technology Group
7 IBM PowerEN™
DIR arb/ctl
L2 Slice
RCRCRCDIR
Reservation
eDRAMeDRAMeDRAMeDRAM
RCRCRCDIRDIR arb/ctl
L2 Slice
RCRCRCDIR
Reservation
eDRAMeDRAMeDRAMeDRAM
RCRCRCDIRDIR arb/ctl
L2 Slice
RCRCRCDIR
Reservation
eDRAMeDRAMeDRAMeDRAM
RCRCRCDIR
CXBAR 4x4
BIU
PB Interface
A2Core
CIU
A2Core
CIU
A2Core
CIU
A2Core
CIU
DIR arb/ctl
L2 Slice
RCRCRCDIR
Reservation
eDRAMeDRAMeDRAMeDRAM
RCRCRCDIR
4 cores x 4 slices of shared L2–512KB per slice–Concurrent reload data to all 4 cores
1:1 with processor cycle time
2MB eDRAM (total) –64B cache lines – Inclusive of L1 I & D caches–8 way set associative
Fast core wake up on reservation loss
ECC (SEC/DED) Data and on Directory
Line locking & Way locking
Slave memory region (non-coherent)
Cache injection (full & partial line)
Power Saving Mode: Rip Van Winkle
PowerPC™ Processing Element Architecture
© 2010 IBM Corporation
Systems and Technology Group
8 IBM PowerEN™
DDR3 DRAM Controller / PHY
Two independent DDR3 DRAM controllers
–2 independent channels / controller
–Registered RDIMMs or unbuffered UDIMMs
–Up to two DIMMs per channel and 1, 2 or 4 ranks on DIMM
–Attach up to 32 GB per DRAM controller
–64 byte block ECC matches cacheline size
–800 MHz, 1.066 GHz, 1.33 GHz, 1.6GHz DRAMS
PBus Interface
–One PBus interface (C/D) per controller
–Capable of 16 Bytes outbound per PBus cycle and 16 Bytes inbound per PBus cycle
DDR3 Memory
DRAMC0 DRAMC1
1.75 GHz
400 / 533 / 667 / 800 MHz
1.75 GHz 1.75 GHz
Async fifoAsync fifo
800 / 1066 / 1333 / 1600 Mbps data
PBUS
DDR3PHY
DDR3PHY
DDR3PHY
DDR3PHY
IOM x4
MC
x2
© 2010 IBM Corporation
Systems and Technology Group
9 IBM PowerEN™
Accelerator (Co-Processor) Architecture
Objectives – Performance– Common API (Ease of Use)– QOS – Virtualization – Protection
Common Architecture for all Accelerators Integrated in Power Architecture
– New Initiate Co-Processor Instruction– New Wait on Loss of Reservation
– (Thread wake up)– Application Context communicated
to Co-Processor– Access Control
L2 Cache Intervention / Injection Accelerator MMU derived from Processor MMU
– Accelerators operate in Application Address Space
Pow
erB
us
DRAM Controller
Packet processing
engine
4 x 10GbE
MAC
PCIe
ExternalPowerBus
Crypto Engine
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
DRAM Controller
RegEx Engine
Compress Engine
XML Engine
© 2010 IBM Corporation
Systems and Technology Group
10 IBM PowerEN™
Accelerator Interface
Software
A2
L2
PB
us
PBICDMA
Engine
Engine
Accelerator Unit
1. Software receives an input packet
2. Software builds the CPB and CRB in cache
3. Software issues the ICSWX instruction
4. L2: ICSWX => Cop_Req PBus command
5. The PBus transports the Cop_Req
6. The PBIC passes the CRB and Cop_Req to Accelerator
7. The DMA logic assigns the Algorithm Engine and fetches data and parameters
8. The DMA logic and the Algorithm Engine work together to process the data and generate the output data
9. The DMA logic stores the output data
10.The DMA logic stores the status into the CSB
11.The DMA logic performs the final handshake
12.The SW retrieves the output status and data
© 2010 IBM Corporation
Systems and Technology Group
11 IBM PowerEN™
Compression/DecompressionPBus
Input Buffer 1KB
History Prefetch 16B X 16
Output Buffer 1KB
CRC-32Adler-32
Inbound Wr/Rd CTL
Outbound Wr/Rd
CTL
Data Engine
PBIC
Decomp Engine
Huffman
Data DCD
DHT
FHT
History Buffer 4KB
LZ77 DCD
Fixed Huffman Coder
String Search CAM
LZ77 Code
Comp Engine
Standards– Supports file formats defined by
RFC1950 (ZLIB) and RFC1952 (GZIP)– Compliant to RFC1951 and DEFLAT
Pipelined data engine– Deep pipelines to minimize latency and
increase bandwidth– 8 Gbps output from engine (Decomp)– 8 Gbps input to engine (Comp)
Decompression– Supports interleaved messages, packet
by packet decompression– Static & Dynamic Huffman decoding
supported Compression
– Supports single messages to be compressed
– Static Huffman coding support
© 2010 IBM Corporation
Systems and Technology Group
12 IBM PowerEN™
Crypto Data Mover
Symmetric Algorithm Acceleration– AESModes:
– Key Lengths: 128b,192b,256b– DES & 3DES Modes– ARC4, Kasumi– HASH: SHA-1,SHA-256,SHA-512, MD5
– HMAC supported for SHA– Combined 3DES/AES and SHA
Asymmetric Algorithm Acceleration – Modular Math Functions for RSA/ECC– Point Functions for ECC – RSA lengths: 512b,1024b,2048b,4096b
Pbus
DMA
AMF
Algorithm and ADM engines
…ADMAES … SHA DES … SHA …
…...DMA ...
DMA ...
2.3Ghz
Command Queues
PBIC
1.15Ghz
Asynchronous Data Mover (ADM)– Any Source Byte offset, Any Dest Byte offset, Any length up to 16M bytes
Random Number Generator (RNG)– Supplies a 64b random number, Supports FIPS 140 compliance
© 2010 IBM Corporation
Systems and Technology Group
13 IBM PowerEN™
XML Unit / XML Engines Fully asynchronous offload operation Four Parsing Engines
–Performs lexical analysis–Checks well-formness–Normalizes whitespace–Seamlessly switches state to process
interleaved document fragments–Processes multiple characters
simultaneously–Supports multiple character encodings
Four Post Processing Engines–XPATH evaluation–Schema validation–Filtering in hardware (reject)–Process a fragment of incoming document–XSLT processing, XML Routing Algorithm Engine
PBIC
XML Engine 3XML
Engine 2XML Engine 1XML
Engine 0
PXL Cache32KB
QNAME Cache
16B
QNAME Assigner
SessionContext Cache16KB
Local Output Storage
Big2LittleEndian Handler
Data Engine
Outbound Wr/Rd
Inbound Wr/Rd
Message Control
© 2010 IBM Corporation
Systems and Technology Group
14 IBM PowerEN™
RegX (Pattern Matching Engine)
Processes 8 CRB’s in parallel
Four independent Physical Lanes; each composed of 4 programming state machines (BFSM)
– Each lane is time multiplex by two logical lanes
– Each BFSM is connected to 32KB of SRAM (total of 512KB)
SRAM holds resident rules (SW managed cache) and temporary rules (HW managed cache)
Strong dependency between hardware, complier, patterns and workload
BFSM0
Classifier Physical Lane [0..3]
ResultsProc
BFSM1
BFSM2
BFSM3
PBIC
BIU
Xbar
CMD UNIT
UploadManager [0..7]
BFSM
32KB SRAM
© 2010 IBM Corporation
Systems and Technology Group
15 IBM PowerEN™
IBM PowerEN™ Packet Processing FrameworkParallel Processing
Packet Pre-Processor
Accelerators
Processing Element
Packet ClassificationFiltering Traffic managementBuffer managementIngress Queue SelectionThread Scheduling
Packet Post -Processor
Processing Element
Processing Element
Interconnect
Centralized Processing Centralized Processing
Packet Ordering Traffic managementBuffer managementIngress Queue SelectionThread Scheduling
Data Plane:Bulk of Protocol ProcessingPacket ForwardingPacket AlterationEgress Queue Selection
Crypto – RegEx – XMLCompDecomp
Control Plane:Call Set -upRoutingMobilityManagement
© 2010 IBM Corporation
Systems and Technology Group
16 IBM PowerEN™
Offload Centralized Media Speed Functions –Packet Classification /Distribution / Ordering –BFSM based Parser–Virtualization - Up to 16 LPARs,
Integrated L2 virtual Switch, PVID END POINT MODE: L4+ termination
–Pull model Software interface (128 Queues)–Scatter/gather descriptors–Low latency Queues, Header separation–TCP/UDP IPv4, v6 Checksum assist–QOS support (ingress Queue selection)
NETWORK NODE MODE: Packet forwarding –Push model Software interface (64 Queues)–HW managed queues – Ingress – Egress Scheduler (Flexible ingress queue) selection–Completion Unit ( Packet ordering – 16K packets)–Thread to Thread messaging with ordering
Packet Processor Architecture
x2 PHY
Host Interface
10GE + 1GE MAC 1GE MAC
PBIC
Enet 10GE1GE
Enet 10GE1GE
Enet 10GE1GE
Enet 10GE1GE
x8 PHY x8 PHY
© 2010 IBM Corporation
Systems and Technology Group
17 IBM PowerEN™
PCI-Express Two PCIe ports PCIe Gen 2 Features
–5Gb/sec per lane/direction–Max Payload: 512B, Max Read: 4K
Root Port Mode - PHB–IODA-based definition–TCE based Address Translation
–TCE cache: 64-entry 4-way–Inbound MSI Validation
Endpoint Mode–PCIe SRIOV Virtualization–DMA Follows accel. SW model
–Engaged via Coprocessor Request –Similar compl/status reporting –Tx and Rx data streaming engines
–Separate Doorbell and Interrupts per PCIe PF/VF–Dedicated Mailbox space
Port 0
X8 PHY
Port 1 PBIC
X16 Pipe X8 Pipe X1 Pipe
HSS Switch
X8 PHY X4 PHY
X1 PCSX16 PCS
PCIE Protcol Stack
TXRX Buff
RC
EP
TCE
INTxIODA
DMA DB/MB
PCIE Protcol Stack
TXRX Buff
RC TCE
INTxIODA
Pkt Proc
© 2010 IBM Corporation
Systems and Technology Group
18 IBM PowerEN™
IBM PowerEN™ Processor Chip
Targeted at the Edge -of-network
Power efficient Throughput computing
Enhanced processing of data payloads
Deeper networking functions
Application targeted at "smarter planet" solutions
PowerEN™ SoC Block Diagram
PowerEN™ has all basic IP needed for user/data plane and control plane traffic processing at the Edge of Network
Pow
erB
us
DRAM Controller
Packet processing
engine
4 x 10GbE
MAC
PCIe
ExternalPowerBus
Crypto Engine
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
4 A2 coresL2 Cache
DRAM Controller
RegEx Engine
Compress Engine
XML Engine
© 2010 IBM Corporation
Systems and Technology Group
19 IBM PowerEN™
Thank You
© 2010 IBM Corporation
Systems and Technology Group
20 IBM PowerEN™
Glossary
BFSM – Binary Finite State Machine BIU – Bus Interface Unit CCB – Coprocessor Completion Block CIU – Core Interface Unit Cop_Req – Coprocessor Request Power Bus Command CPB – Coprocessor Parameter Block CRB – Coprocessor Request Block CSB – Coprocessor Status Block DHT – Dynamic Huffman Table FHT – Fixed Huffman Table HSS – High Speed Serial ICSWX – Initiate Coprocessor Store Word Inde Xed LPAR – Logical Partition PBIC – Power Bus Interface Controller PBus – Power Bus PVID – Partition Virtual Identifier