power edge of network(tm) processor...– area: 410 mm2 – 100 array types, ~300 instances – 124...

20
© 2010 IBM Corporation The IBM Power Edge of Network™ Processor: A wire-speed System-on-a-Chip with 16 Power™ cores / 64 threads and optimized HW acceleration Jeffrey D. Brown Distinguished Engineer and IBM Academy of Technology Member IBM Corporation [email protected] Co-authors: Sandra Woodward, Brian Bass, Charlie Johnson

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

The IBM Power Edge of Network™ Processor:A wire-speed System-on-a-Chip with 16 Power™ cores / 64 threads and optimized HW acceleration

Jeffrey D. Brown Distinguished Engineer and IBM Academy of Technology MemberIBM [email protected]

Co-authors: Sandra Woodward, Brian Bass, Charlie Johnson

Page 2: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

2 IBM PowerEN™

Overview of IBM PowerEN™ Processor Chip 64-bit PowerPC Architecture

Virtualization Support

Dual DDR3 DRAM Controllers

Optimized Ethernet Offload Engine

Integrated PCI-Express bus

Cryptography Unit

Regular Expression Unit

XML Processing Unit

Compression Unit

Upward Scalability – 4 Chip

Downward scalability - Subset

PowerEN™ SoC Block Diagram

PowerEN™ has all basic IP needed for user/data plane and control plane t raffic processing

Pow

erB

us

DRAM Controller

Packet processing

engine

4 x 10GbE

MAC

PCIe

ExternalPowerBus

Crypto Engine

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

DRAM Controller

RegEx Engine

Compress Engine

XML Engine

Page 3: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

3 IBM PowerEN™

IBM PowerEN™: Targeted at the Edge-of-Network

Power efficient Throughput computing

– Database Acceleration

– Service Oriented Architecture Acceleration

– Secure Multi-tenant Cloud Computing

Enhanced processing of data payloads

– Low latency message for Financial Information Exchange

Deeper networking functions

– Cyber Security

– Network Intrusion Prevention

Application targeted at "smarter planet" solutions

– Compartmentalized streamed analytics

– Data Reduction in storage subsystems

Page 4: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

4 IBM PowerEN™

Chip statistics:– 45nm - ultra-low k BEOL– Area: 410 mm2– 100 array types, ~300 instances– 124 RLMs, 1044 instances– 1.43B Transistors– 3.2 million latches– 5 PLLs, 10 concurrent frequencies

Package Technology– 50mm FCPBGA (6) build-up layers – C4 Pitch: 148 x 185 mm pitch

Power Delivery– Deep Trench Chip Capacitors– Adaptive Power Supply Strategy– 8 voltages domains

IBM PowerEN™ Chip

Page 5: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

5 IBM PowerEN™

Interconnect Architecture

“All Peers” architecture– Accel. and I/O are first class citizens

Proven Power-Bus architecture– Independent CMD Network (one/cycle)

– Two north, two south 16B data busses

– ECC protected data paths

64 Byte Cache Line Cache Injection

– Packets flow to / from Caches

– New PBus commands

Pow

erB

us

DRAM Controller

Packet processing

engine

4 x 10GbE

MAC

PCIe

ExternalPowerBus

Crypto Engine

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

DRAM Controller

RegEx Engine

Compress Engine

XML Engine

1.75 GHz operation– Asynchronous connection to AT Nodes and accelerators via PBICs

– Synchronous connection to DRAM controllers

– Three 4B 2.5 GHz EI3 external links (1,2, or 4 chip systems)

Page 6: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

6 IBM PowerEN™

PowerPC™ Processing Element Architecture PowerPC 64 architecture – Embedded

Enhanced for Co-processor/I/O interface

4 way Fine Grained SMT

In Order Dispatch and Execution

2 way concurrent issue.

– 1 Integer + 1 FPU instruction per cycle – Different threads

Unified fixed point, ld/st, and branch unit

16KB L1 Data Cache – 4 Way Assoc.

16KB L1 Instr Cache – 8 Way Assoc.

12 Stage Pipe 27 FO4 design (7 XU, 5 IU, 6 FU)

Fully associative I and D – ERAT

MMU: 512 Entry TLB w/ Hardware Table Walk

Hypervisor / Virtualization - One logical partition per core

Pow

erB

us

DRAM Controller

Packet processing

engine

4 x 10GbE

MAC

PCIe

ExternalPowerBus

Crypto Engine

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

DRAM Controller

RegEx Engine

Compress Engine

XML Engine

Page 7: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

7 IBM PowerEN™

DIR arb/ctl

L2 Slice

RCRCRCDIR

Reservation

eDRAMeDRAMeDRAMeDRAM

RCRCRCDIRDIR arb/ctl

L2 Slice

RCRCRCDIR

Reservation

eDRAMeDRAMeDRAMeDRAM

RCRCRCDIRDIR arb/ctl

L2 Slice

RCRCRCDIR

Reservation

eDRAMeDRAMeDRAMeDRAM

RCRCRCDIR

CXBAR 4x4

BIU

PB Interface

A2Core

CIU

A2Core

CIU

A2Core

CIU

A2Core

CIU

DIR arb/ctl

L2 Slice

RCRCRCDIR

Reservation

eDRAMeDRAMeDRAMeDRAM

RCRCRCDIR

4 cores x 4 slices of shared L2–512KB per slice–Concurrent reload data to all 4 cores

1:1 with processor cycle time

2MB eDRAM (total) –64B cache lines – Inclusive of L1 I & D caches–8 way set associative

Fast core wake up on reservation loss

ECC (SEC/DED) Data and on Directory

Line locking & Way locking

Slave memory region (non-coherent)

Cache injection (full & partial line)

Power Saving Mode: Rip Van Winkle

PowerPC™ Processing Element Architecture

Page 8: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

8 IBM PowerEN™

DDR3 DRAM Controller / PHY

Two independent DDR3 DRAM controllers

–2 independent channels / controller

–Registered RDIMMs or unbuffered UDIMMs

–Up to two DIMMs per channel and 1, 2 or 4 ranks on DIMM

–Attach up to 32 GB per DRAM controller

–64 byte block ECC matches cacheline size

–800 MHz, 1.066 GHz, 1.33 GHz, 1.6GHz DRAMS

PBus Interface

–One PBus interface (C/D) per controller

–Capable of 16 Bytes outbound per PBus cycle and 16 Bytes inbound per PBus cycle

DDR3 Memory

DRAMC0 DRAMC1

1.75 GHz

400 / 533 / 667 / 800 MHz

1.75 GHz 1.75 GHz

Async fifoAsync fifo

800 / 1066 / 1333 / 1600 Mbps data

PBUS

DDR3PHY

DDR3PHY

DDR3PHY

DDR3PHY

IOM x4

MC

x2

Page 9: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

9 IBM PowerEN™

Accelerator (Co-Processor) Architecture

Objectives – Performance– Common API (Ease of Use)– QOS – Virtualization – Protection

Common Architecture for all Accelerators Integrated in Power Architecture

– New Initiate Co-Processor Instruction– New Wait on Loss of Reservation

– (Thread wake up)– Application Context communicated

to Co-Processor– Access Control

L2 Cache Intervention / Injection Accelerator MMU derived from Processor MMU

– Accelerators operate in Application Address Space

Pow

erB

us

DRAM Controller

Packet processing

engine

4 x 10GbE

MAC

PCIe

ExternalPowerBus

Crypto Engine

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

DRAM Controller

RegEx Engine

Compress Engine

XML Engine

Page 10: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

10 IBM PowerEN™

Accelerator Interface

Software

A2

L2

PB

us

PBICDMA

Engine

Engine

Accelerator Unit

1. Software receives an input packet

2. Software builds the CPB and CRB in cache

3. Software issues the ICSWX instruction

4. L2: ICSWX => Cop_Req PBus command

5. The PBus transports the Cop_Req

6. The PBIC passes the CRB and Cop_Req to Accelerator

7. The DMA logic assigns the Algorithm Engine and fetches data and parameters

8. The DMA logic and the Algorithm Engine work together to process the data and generate the output data

9. The DMA logic stores the output data

10.The DMA logic stores the status into the CSB

11.The DMA logic performs the final handshake

12.The SW retrieves the output status and data

Page 11: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

11 IBM PowerEN™

Compression/DecompressionPBus

Input Buffer 1KB

History Prefetch 16B X 16

Output Buffer 1KB

CRC-32Adler-32

Inbound Wr/Rd CTL

Outbound Wr/Rd

CTL

Data Engine

PBIC

Decomp Engine

Huffman

Data DCD

DHT

FHT

History Buffer 4KB

LZ77 DCD

Fixed Huffman Coder

String Search CAM

LZ77 Code

Comp Engine

Standards– Supports file formats defined by

RFC1950 (ZLIB) and RFC1952 (GZIP)– Compliant to RFC1951 and DEFLAT

Pipelined data engine– Deep pipelines to minimize latency and

increase bandwidth– 8 Gbps output from engine (Decomp)– 8 Gbps input to engine (Comp)

Decompression– Supports interleaved messages, packet

by packet decompression– Static & Dynamic Huffman decoding

supported Compression

– Supports single messages to be compressed

– Static Huffman coding support

Page 12: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

12 IBM PowerEN™

Crypto Data Mover

Symmetric Algorithm Acceleration– AESModes:

– Key Lengths: 128b,192b,256b– DES & 3DES Modes– ARC4, Kasumi– HASH: SHA-1,SHA-256,SHA-512, MD5

– HMAC supported for SHA– Combined 3DES/AES and SHA

Asymmetric Algorithm Acceleration – Modular Math Functions for RSA/ECC– Point Functions for ECC – RSA lengths: 512b,1024b,2048b,4096b

Pbus

DMA

AMF

Algorithm and ADM engines

…ADMAES … SHA DES … SHA …

…...DMA ...

DMA ...

2.3Ghz

Command Queues

PBIC

1.15Ghz

Asynchronous Data Mover (ADM)– Any Source Byte offset, Any Dest Byte offset, Any length up to 16M bytes

Random Number Generator (RNG)– Supplies a 64b random number, Supports FIPS 140 compliance

Page 13: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

13 IBM PowerEN™

XML Unit / XML Engines Fully asynchronous offload operation Four Parsing Engines

–Performs lexical analysis–Checks well-formness–Normalizes whitespace–Seamlessly switches state to process

interleaved document fragments–Processes multiple characters

simultaneously–Supports multiple character encodings

Four Post Processing Engines–XPATH evaluation–Schema validation–Filtering in hardware (reject)–Process a fragment of incoming document–XSLT processing, XML Routing Algorithm Engine

PBIC

XML Engine 3XML

Engine 2XML Engine 1XML

Engine 0

PXL Cache32KB

QNAME Cache

16B

QNAME Assigner

SessionContext Cache16KB

Local Output Storage

Big2LittleEndian Handler

Data Engine

Outbound Wr/Rd

Inbound Wr/Rd

Message Control

Page 14: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

14 IBM PowerEN™

RegX (Pattern Matching Engine)

Processes 8 CRB’s in parallel

Four independent Physical Lanes; each composed of 4 programming state machines (BFSM)

– Each lane is time multiplex by two logical lanes

– Each BFSM is connected to 32KB of SRAM (total of 512KB)

SRAM holds resident rules (SW managed cache) and temporary rules (HW managed cache)

Strong dependency between hardware, complier, patterns and workload

BFSM0

Classifier Physical Lane [0..3]

ResultsProc

BFSM1

BFSM2

BFSM3

PBIC

BIU

Xbar

CMD UNIT

UploadManager [0..7]

BFSM

32KB SRAM

Page 15: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

15 IBM PowerEN™

IBM PowerEN™ Packet Processing FrameworkParallel Processing

Packet Pre-Processor

Accelerators

Processing Element

Packet ClassificationFiltering Traffic managementBuffer managementIngress Queue SelectionThread Scheduling

Packet Post -Processor

Processing Element

Processing Element

Interconnect

Centralized Processing Centralized Processing

Packet Ordering Traffic managementBuffer managementIngress Queue SelectionThread Scheduling

Data Plane:Bulk of Protocol ProcessingPacket ForwardingPacket AlterationEgress Queue Selection

Crypto – RegEx – XMLCompDecomp

Control Plane:Call Set -upRoutingMobilityManagement

Page 16: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

16 IBM PowerEN™

Offload Centralized Media Speed Functions –Packet Classification /Distribution / Ordering –BFSM based Parser–Virtualization - Up to 16 LPARs,

Integrated L2 virtual Switch, PVID END POINT MODE: L4+ termination

–Pull model Software interface (128 Queues)–Scatter/gather descriptors–Low latency Queues, Header separation–TCP/UDP IPv4, v6 Checksum assist–QOS support (ingress Queue selection)

NETWORK NODE MODE: Packet forwarding –Push model Software interface (64 Queues)–HW managed queues – Ingress – Egress Scheduler (Flexible ingress queue) selection–Completion Unit ( Packet ordering – 16K packets)–Thread to Thread messaging with ordering

Packet Processor Architecture

x2 PHY

Host Interface

10GE + 1GE MAC 1GE MAC

PBIC

Enet 10GE1GE

Enet 10GE1GE

Enet 10GE1GE

Enet 10GE1GE

x8 PHY x8 PHY

Page 17: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

17 IBM PowerEN™

PCI-Express Two PCIe ports PCIe Gen 2 Features

–5Gb/sec per lane/direction–Max Payload: 512B, Max Read: 4K

Root Port Mode - PHB–IODA-based definition–TCE based Address Translation

–TCE cache: 64-entry 4-way–Inbound MSI Validation

Endpoint Mode–PCIe SRIOV Virtualization–DMA Follows accel. SW model

–Engaged via Coprocessor Request –Similar compl/status reporting –Tx and Rx data streaming engines

–Separate Doorbell and Interrupts per PCIe PF/VF–Dedicated Mailbox space

Port 0

X8 PHY

Port 1 PBIC

X16 Pipe X8 Pipe X1 Pipe

HSS Switch

X8 PHY X4 PHY

X1 PCSX16 PCS

PCIE Protcol Stack

TXRX Buff

RC

EP

TCE

INTxIODA

DMA DB/MB

PCIE Protcol Stack

TXRX Buff

RC TCE

INTxIODA

Pkt Proc

Page 18: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

18 IBM PowerEN™

IBM PowerEN™ Processor Chip

Targeted at the Edge -of-network

Power efficient Throughput computing

Enhanced processing of data payloads

Deeper networking functions

Application targeted at "smarter planet" solutions

PowerEN™ SoC Block Diagram

PowerEN™ has all basic IP needed for user/data plane and control plane traffic processing at the Edge of Network

Pow

erB

us

DRAM Controller

Packet processing

engine

4 x 10GbE

MAC

PCIe

ExternalPowerBus

Crypto Engine

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

4 A2 coresL2 Cache

DRAM Controller

RegEx Engine

Compress Engine

XML Engine

Page 19: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

19 IBM PowerEN™

Thank You

Page 20: Power Edge of Network(tm) Processor...– Area: 410 mm2 – 100 array types, ~300 instances – 124 RLMs, 1044 instances – 1.43B Transistors – 3.2 million latches – 5 PLLs, 10

© 2010 IBM Corporation

Systems and Technology Group

20 IBM PowerEN™

Glossary

BFSM – Binary Finite State Machine BIU – Bus Interface Unit CCB – Coprocessor Completion Block CIU – Core Interface Unit Cop_Req – Coprocessor Request Power Bus Command CPB – Coprocessor Parameter Block CRB – Coprocessor Request Block CSB – Coprocessor Status Block DHT – Dynamic Huffman Table FHT – Fixed Huffman Table HSS – High Speed Serial ICSWX – Initiate Coprocessor Store Word Inde Xed LPAR – Logical Partition PBIC – Power Bus Interface Controller PBus – Power Bus PVID – Partition Virtual Identifier