page 1 john morgan infrastructure processor division september 2004 intel® ixp2xxx network...
Post on 19-Dec-2015
223 views
TRANSCRIPT
Page 1
John MorganJohn Morgan
Infrastructure Processor DivisionInfrastructure Processor Division
September 2004September 2004
Intel® IXP2XXX Network Intel® IXP2XXX Network Processor Architecture OverviewProcessor Architecture Overview
Customer ASICs
IXP2400 External FeaturesIXP2400 External Features
Utopia 1/2/3 orPOS-PL2/3Interface
PCI 64-bit / 66 MHz
IXP2400
(Ingress)
HostCPU
(Optional)
ATM / POS PHY
or Ethernet MAC
Flash
Classification Accelerator
CoProc BusMicro-Engine
Clusters
Slow Port
Switch Fabric Port Interface
Utopia 1,2,3SPI – 3 (POS-PL3)
CSIX
IXP2400(Egress)
Flow Control Bus
External InterfacesExternal Interfaces MSF Interface supports UTOPIA 1/2/3, MSF Interface supports UTOPIA 1/2/3,
SPI-3 (POS-PL3), and CSIX.SPI-3 (POS-PL3), and CSIX. Four independent, configurable, 8-bit Four independent, configurable, 8-bit
channels with the ability to aggregate channels with the ability to aggregate channels for wider interfaces.channels for wider interfaces.
Media interface can support Media interface can support channelized media on RX and 32-bit channelized media on RX and 32-bit connect to Switch Fabric over SPI-3 on connect to Switch Fabric over SPI-3 on TX (and vice versa) to support Switch TX (and vice versa) to support Switch Fabric option.Fabric option.
2 Quad Data Rate SRAM channels.2 Quad Data Rate SRAM channels. A QDR SRAM channel can interface to A QDR SRAM channel can interface to
Co-Processors.Co-Processors. 1 DDR SDRAM channel.1 DDR SDRAM channel. PCI 64/66 Host CPU interface.PCI 64/66 Host CPU interface. Flash and PHY Mgmt interface.Flash and PHY Mgmt interface. Dedicated inter-IXP channel to Dedicated inter-IXP channel to
communicate fabric flow control communicate fabric flow control information from egress to ingress for information from egress to ingress for dual chip solution.dual chip solution.
DDR DRAM2 GByte
QDR SRAM1.6 GBs
64 M Byte
IXA SW
MEv26
MEv27
MEv25
MEv28
Intel®XScale™
Core32K IC32K DC
Rbuf64 @ 128B
Tbuf64 @ 128B
Hash64/48/128
Scratch16KB
QDRSRAM
1
QDRSRAM
2
DDRAM
GASKET
PCI
(64b)66 MHz
32b32b
32b32b
1818 18181818 1818
7272
64b64b
SPI3orCSIX
E/D Q E/D Q
MEv22
MEv23
MEv21
MEv24
CSRs -Fast_wr -UART-Timers -GPIO-BootROM/Slow Port
IXP2400IXP2400
IXP2400 Resources SummaryIXP2400 Resources Summary Half Duplex OC-48 / 2.5 Gb/sec Network ProcessorHalf Duplex OC-48 / 2.5 Gb/sec Network Processor (8) Multi-Threaded Microengines(8) Multi-Threaded Microengines Intel® XScale™ CoreIntel® XScale™ Core Media / Switch Fabric InterfaceMedia / Switch Fabric Interface PCI interfacePCI interface 2 QDR SRAM interface controllers2 QDR SRAM interface controllers 1 DDR SDRAM interface controller1 DDR SDRAM interface controller 8 bit asynchronous port8 bit asynchronous port
– Flash and CPU busFlash and CPU bus Additional integrated featureAdditional integrated feature
– Hardware Hash Unit Hardware Hash Unit – 16 KByte Scratchpad Memory,Serial UART port 16 KByte Scratchpad Memory,Serial UART port – 8 general purpose I/O pins8 general purpose I/O pins– Four 32-bit timersFour 32-bit timers– JTAG SupportJTAG Support
IXP2800 External FeaturesIXP2800 External Features
Customer ASICs
SPI-4 or CSIX-L1
PCI 64-bit / 66 MHz
IXP2800
(Ingress)
HostCPU
(Optional)
ATM / POS PHY
or Ethernet MAC
Flash
Classification Accelerator
CoProc BusMicro-Engine
Clusters
Slow Port
Switch Fabric Port Interface
SPI – 4, CSIX-L1
IXP2800(Egress)
Flow Control Bus
External InterfacesExternal Interfaces Media Interface supports Media Interface supports
both SPI-4 and CSIXboth SPI-4 and CSIX 4 Quad Data Rate (QDR) 4 Quad Data Rate (QDR)
SRAM channelsSRAM channels Each channel can Each channel can
interface to Co-interface to Co-processorsprocessors
3 RDRAM Channels3 RDRAM Channels PCI 64/66 Host CPU interfacePCI 64/66 Host CPU interface Flash and PHY Management Flash and PHY Management
interfaceinterface Dedicated inter-IXP channel Dedicated inter-IXP channel
to communicate fabric flow to communicate fabric flow control information from control information from egress to ingress for dual egress to ingress for dual chip solutionchip solution
RDR DRAM50+Gbps
2 Gbyte total for 3 channels
QDR SRAM12.8 Gbps x 464 M Byte x 4
channels
IXA SW
Page 6
Intel®XScale™
Core32K IC32K DC MEv2
10MEv2
11MEv2
12
MEv215
MEv214
MEv213
Rbuf64 @ 128B
Tbuf64 @ 128B
Hash48/64/128
Scratch16KBQDR
SRAM2
QDRSRAM
1
RDRAM1
RDRAM3
RDRAM2
GASKET
PCI
(64b)66 MHz
IXP2800IXP2800
16b16b
16b16b
1818 18181818 1818
1818 1818 1818
64b64b
SPI4orCSIX
Stripe
E/D Q E/D Q
QDRSRAM
3
E/D Q
1818 1818
MEv29
MEv216
MEv22
MEv23
MEv24
MEv27
MEv26
MEv25
MEv21
MEv28
CSRs -Fast_wr -UART-Timers -GPIO-BootROM/SlowPort
QDRSRAM
4
E/D Q
1818 1818
IXP2800 Resources SummaryIXP2800 Resources Summary Half Duplex OC-192 / 10 Gb/sec Network ProcessorHalf Duplex OC-192 / 10 Gb/sec Network Processor (16) Multi-Threaded Microengines(16) Multi-Threaded Microengines Intel® XScale™ CoreIntel® XScale™ Core Media / Switch Fabric InterfaceMedia / Switch Fabric Interface PCI interfacePCI interface 4 QDR SRAM Interface Controllers4 QDR SRAM Interface Controllers 3 Rambus* DRAM Interface Controllers3 Rambus* DRAM Interface Controllers 8 bit asynchronous port8 bit asynchronous port
– Flash and CPU busFlash and CPU bus Additional integrated featuresAdditional integrated features
– Hardware Hash Unit for generating of 48-, 64-, or 128-bit adaptive Hardware Hash Unit for generating of 48-, 64-, or 128-bit adaptive polynomial hash keyspolynomial hash keys
– 16 KByte Scratchpad Memory 16 KByte Scratchpad Memory – Serial UART port for debug Serial UART port for debug – 8 general purpose I/O pins 8 general purpose I/O pins – Four 32-bit timers Four 32-bit timers – JTAG SupportJTAG Support
IXP2800 and IXP2400IXP2800 and IXP2400 Comparison Comparison
Dual chip full duplex OC48Dual chip full duplex OC48Dual chip full duplex OC192Dual chip full duplex OC192PerformancePerformance
8 (MEv2)8 (MEv2)16 (MEv2)16 (MEv2)Number of Number of
MicroEnginesMicroEngines
Separate 32 bit Tx & Rx Separate 32 bit Tx & Rx
configurable to SPI-3, UTOPIA 3 configurable to SPI-3, UTOPIA 3
or CSIX_L1or CSIX_L1
Separate 16 bit Tx & Rx Separate 16 bit Tx & Rx
configurable to SPI-4 P2 or configurable to SPI-4 P2 or
CSIX_L1CSIX_L1
Media InterfaceMedia Interface
2 channels QDR (or co-2 channels QDR (or co-
processor)processor)4 channels QDR (or co-4 channels QDR (or co-
processor)processor)SRAM MemorySRAM Memory
1 channel DDR DRAM - 150MHz; 1 channel DDR DRAM - 150MHz;
Up to 2GBUp to 2GB3 channels RDRAM 3 channels RDRAM
800/1066MHz; Up to 2GB800/1066MHz; Up to 2GBDRAM MemoryDRAM Memory
600/400MHz600/400MHz1.4/1.0 GHz/ 650 MHz1.4/1.0 GHz/ 650 MHzFrequencyFrequency
IXP2400IXP2400IXP2800IXP2800
128GPR
Control Store
4K/8K Instructions
128 GPR
Local Memory640 words
128 Next Neighbor
128 S Xfer Out
128 D Xfer Out
OtherLocal CSRs
CRC Unit
128 S Xfer In
128 D Xfer In
LM Addr 1LM Addr 0
D-Push Bus
S-Push Bus
D-Pull Bus S-Pull Bus
To Next Neighbor
From Next Neighbor
A_Operand B_Operand
ALU_Out
P-Random #
32-bit ExecutionData Path
Multiply
Find first bit
Add, shift, logical
2 per CTX
CRC remain
Lock0-15
StatusandLRULogic(6-bit)
TAGs 0-15
Status Entry#
CA
M
Timers
Timestamp
Prev B
B_op
Prev A
A_op
MicroEngine v2MicroEngine v2
Clock RatesClock Rates– IXP2400 – 600/400 MHzIXP2400 – 600/400 MHz– IXP2800 - 1.4/1.0 GHz/ 650 MHzIXP2800 - 1.4/1.0 GHz/ 650 MHz
Control StoreControl Store– IXP2400 – 4K Instruction storeIXP2400 – 4K Instruction store– IXP2800 – 8K Instruction storeIXP2800 – 8K Instruction store
Configurable to 4 or 8 threadsConfigurable to 4 or 8 threads– Each thread has its own program counter, registers, Each thread has its own program counter, registers,
signal and wakeup eventssignal and wakeup events– Generalized Thread Signaling (15 signals per thread)Generalized Thread Signaling (15 signals per thread)
Local Storage OptionsLocal Storage Options– 256 GPRs256 GPRs– 256 Transfer Registers256 Transfer Registers– 128 Next Neighbor Registers128 Next Neighbor Registers– 640 - 32bit words of local memory640 - 32bit words of local memory
Microengine v2 Features – Part 1Microengine v2 Features – Part 1
CAM (Content Addressable Memory)CAM (Content Addressable Memory)– Performs parallel lookup on 16 - 32bit entriesPerforms parallel lookup on 16 - 32bit entries– Reports a 9-bit lookup result Reports a 9-bit lookup result
– 4 State bits (software controlled, no impact to hardware)4 State bits (software controlled, no impact to hardware)– Hit – entry number that hit; Miss – LRU entryHit – entry number that hit; Miss – LRU entry– 4-bit index of Cam entry (Hit) or LRU (Miss)4-bit index of Cam entry (Hit) or LRU (Miss)
– Improves usage of multiple threads on same dataImproves usage of multiple threads on same data CRC hardwareCRC hardware
– IXP2400 - Provides CRC_16, CRC_32IXP2400 - Provides CRC_16, CRC_32– IXP2800 - Provides CRC_16, CRC_32, iSCSI, CRC_10 and CRC_5IXP2800 - Provides CRC_16, CRC_32, iSCSI, CRC_10 and CRC_5– Accelerates CRC computation for ATM AAL/SAR, ATM OAM and Storage Accelerates CRC computation for ATM AAL/SAR, ATM OAM and Storage
applicationsapplications Multiply hardwareMultiply hardware
– Supports 8x24, 16x16 and 32x32 Supports 8x24, 16x16 and 32x32 – Accelerates metering in QoS algorithmsAccelerates metering in QoS algorithms
– DiffServ, MPLSDiffServ, MPLS Pseudo Random Number generationPseudo Random Number generation
– Accelerates RED, WRED algorithmsAccelerates RED, WRED algorithms 64-bit Time-stamp and 16-bit Profile count64-bit Time-stamp and 16-bit Profile count
Microengine v2 Features – Part 2Microengine v2 Features – Part 2
Intel® XScale™ Core OverviewIntel® XScale™ Core Overview
High-performance, Low-power, 32-bit Embedded High-performance, Low-power, 32-bit Embedded RISC processorRISC processor
Clock rateClock rate– IXP2400 600 MHzIXP2400 600 MHz
– IXP2800 700/500/325 MHzIXP2800 700/500/325 MHz
32 Kbyte instruction cache32 Kbyte instruction cache 32 Kbyte data cache32 Kbyte data cache 2 Kbyte mini-data cache2 Kbyte mini-data cache Write bufferWrite buffer Memory management unitMemory management unit
Page 13
Web Switch Design Using Web Switch Design Using Network Processors – NSF Network Processors – NSF
Project 2002-2005Project 2002-2005
Funded by NSF and Intel – Not Intel ConfidentialFunded by NSF and Intel – Not Intel ConfidentialL. Zhao, Y. Luo, L. Bhuyan and R. Iyer, “A NetworkL. Zhao, Y. Luo, L. Bhuyan and R. Iyer, “A Network
Processor-Based Content Aware Switch”Processor-Based Content Aware Switch”IEEE Micro, May/June 2006IEEE Micro, May/June 2006
Web Switch or Layer 5 SwitchWeb Switch or Layer 5 Switch
Layer 4 switchLayer 4 switch
– Content blindContent blind
– Storage overheadStorage overhead
– Difficult to administerDifficult to administer Content-aware (Layer 5/7) switchContent-aware (Layer 5/7) switch
– Partition the server’s database over different nodesPartition the server’s database over different nodes
– Increase the performance due to improved hit rateIncrease the performance due to improved hit rate
– Server can be specialized for certain types of requestServer can be specialized for certain types of request
Switch
Image Server
Application Server
HTML Server
www.yahoo.comInternet
GET /cgi-bin/form HTTP/1.1 Host: www.yahoo.com…
APP. DATATCPIP
Layer-7 Two-way MechanismsLayer-7 Two-way Mechanisms
TCP gatewayTCP gateway Application level proxy on Application level proxy on
the web switch mediates the web switch mediates the communication the communication between the client and the between the client and the serverserver
TCP splicingTCP splicing Reduce the overhead in Reduce the overhead in
TCP gateway by TCP gateway by forwarding directly by OSforwarding directly by OS
kernel
user
kernel
TCP SplicingTCP Splicing
Establish connection Establish connection with the clientwith the client
– Three-way handshakeThree-way handshake
Choose the serverChoose the server Establish connection Establish connection
with the serverwith the server Splice two connectionsSplice two connections Map the sequence for Map the sequence for
subsequent packets subsequent packets
SYNC
SYND,ACKC+1
Client Switch Server
Time
SYNS,ACKC+1
ACKD+1,DataC+1
ACKD+len+1 D ->S
ACKS+len+1
SYNC
ACKS+1,DataC+1D ->S
D<- SACKC+len+1,DataD+1 ACKC+len+1,DataS
+1
Partitioning the WorkloadPartitioning the Workload
Latency on a Linux-based switchLatency on a Linux-based switch
Latency is reduced by TCP splicingLatency is reduced by TCP splicing
Latency using NPLatency using NP
02468
101214161820
1 4 16 64 256 1024
Request file size (KB)
Late
ncy o
n t
he s
wit
ch
(m
s)
Linux Splicer
SpliceNP
ThroughputThroughput
0
100
200
300
400
500
600
700
800
1 4 16 64 256 1024
Request file size (KB)
Th
rou
gh
pu
t (M
bp
s) Linux Splicer
SpliceNP
NePSim: NePSim: http://www.cs.ucr.edu/~yluo/nepsim/http://www.cs.ucr.edu/~yluo/nepsim/ ObjectivesObjectives
– Open-sourceOpen-source
– Cycle-level accuracyCycle-level accuracy
– FlexibilityFlexibility
– Integrated power modelIntegrated power model
– Fast simulation speedFast simulation speed
ChallengesChallenges– Domain specific instruction set Domain specific instruction set
– Porting network benchmarks Porting network benchmarks
– Difficulty in debugging multithreaded programsDifficulty in debugging multithreaded programs
– Verification of the functionality and timing Verification of the functionality and timing
Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim, IEEE Micro Special Issue on NP, Sept/Oct 2004, Intel IXP Summit Sept 2004, Users from UCSD, Univ. of Arizona, Georgia Tech, Northwestern Univ., Tsinghua Univ. NePSim has so far 3530 web page visits, 806 downloads by October 2006 since July 2004
NePSim Software ArchitectureNePSim Software Architecture
Microengine (six)Microengine (six)
Memory (SRAM/SDRAM)
Network Device
Debugger
Statistic
Verification
Microengine SRAM
SDRAM Network Device
Stats
Debugger
Verification
NePSim
Power ModelPower Model
H/W componentH/W component Model TypeModel Type ToolTool ConfigurationsConfigurations
GPR per GPR per MicroengineMicroengine
ArrayArray XCactiXCacti 2 64-entry files, one read/write 2 64-entry files, one read/write port per fileport per file
Control store, Control store, scratchpadscratchpad
Cache w/o Cache w/o tag pathtag path
XCactiXCacti 4KB, 4byte per block, direct 4KB, 4byte per block, direct mapped, 10-bit addressmapped, 10-bit address
ALU, shifterALU, shifter ALU and ALU and shiftershifter
Wattch Wattch 32bit32bit
…… …… …… ……
BenchmarksBenchmarks
ipfwdripfwdr
– IPv4 forwarding(header validation, IP lookup)IPv4 forwarding(header validation, IP lookup)
– Medium SRAM accessMedium SRAM access natnat
– Network address translationNetwork address translation
– Medium SRAM accessMedium SRAM access urlurl
– Examines payload for URL pattern Examines payload for URL pattern
– Heavy SDRAM accessHeavy SDRAM access md4md4
– Compute a 128-bit message “signature”Compute a 128-bit message “signature”
– Heavy computation and SDRAM accessHeavy computation and SDRAM access
Verification of NePSimVerification of NePSim
NePSimIXP1200 PerformanceStatistics
benchmarks
?=
23990 inst.(pc=129) executed
24008 sram req issued
24009 ….
23990 inst.(pc=129) executed
24008 sram req issued
24009 ….
Assertion Based Verification(Linear Temporal Logic/Logic Of Constraint)
X. Chen, Y. Luo, H. Hsieh, L. Bhuyan, F. Balarin, "Utilizing Formal Assertions for System Design of Network Processors," Design Automation and Test in Europe (DATE), 2004.
Performance-Power TrendPerformance-Power Trend Power consumption increases faster than
performance
url ipfwdr
md4 nat
Power
Performance
Power
Power
Power
Performance
Performance
Performance
Dynamic Voltage ScalingDynamic Voltage Scaling
Reduce PE voltage and frequency when PE has idle timeReduce PE voltage and frequency when PE has idle time
Voltage Frequency
Power = C • α • V2 • f
Power Reduction with DVSPower Reduction with DVS
Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim: A Network Processor Simulator with Power Evaluation Framework, IEEE Micro Special Issue on Network Processors, Sept/Oct 2004
Power Reduction
Perf. Reduction
url ipfwdr md4 nat avg
Power Saving by Clock GatingPower Saving by Clock Gating
Shutdown unnecessary PEs, re-activate PEs when needed
Clock gating retains PE instructions
Yan Luo, Jia Yu, Jun Yang, Laxmi Bhuyan, Low Power Network Processor Design Using Clock Gating, IEEE/ACM Design Automation Conference (DAC), June , 2005 , Extended Version to appear in ACM Trans on Architecture and Code Optimization
Challenges of Clock Gating PEsChallenges of Clock Gating PEs
Terminating threads safelyTerminating threads safely– Threads request memory resources Threads request memory resources
– Stop unfinished threads result in resource leakageStop unfinished threads result in resource leakage
Reschedule packets to avoid “orphan” ports Static thread-port mapping prohibits shutting down
PEs Dynamically assign packets to any waiting threads
Avoid “extra” packet loss Burst packet arrival can overflow internal buffer Use a small extra buffer space to handle burst
Experiment Results of Clock GatingExperiment Results of Clock Gating
<4% reduction on system throughput
Main ContributionsMain Contributions
Constructed an execution driven multiprocessor router simulation Constructed an execution driven multiprocessor router simulation framework, proposed a set of benchmark applications and framework, proposed a set of benchmark applications and evaluated performance evaluated performance
Built NePSim, the first open-source network processor simulator, Built NePSim, the first open-source network processor simulator, ported network benchmarks and conducted performance and ported network benchmarks and conducted performance and power evaluationpower evaluation
Applied dynamic voltage scaling to reduce power consumptionApplied dynamic voltage scaling to reduce power consumption Used clock gating to adapt number of active PEs according to real-Used clock gating to adapt number of active PEs according to real-
time traffictime traffic