collecting data over the network

IEEE Real Time 2007, Fermilab, 29 April – 4 May R. Hughes-Jones Manchester1

Using FPGAs to Generate Gigabit Ethernet Data Transfers

&The Network Performance of DAQ Protocols

Dave Bailey, Richard Hughes-Jones, Marc Kelly The University of Manchester

www.hep.man.ac.uk/~rich/ then “Talks”

http://www.hep.man.ac.uk/~rich/


Collecting Data over the Network

●●●

Custom Links

Ethernet Switches

Processing Nodes

1 Burst / Node

Detector elements e.g. Calorimeter Planks

Output linkBottleneckQueue

Aim for a general purpose DAQ solution for CALICE

CAlorimeter for the LInear Collider Experiment

Take ECAL as an example. At the end of the beam spill the

planks send all the data, to the concentrators

Concentrators ‘pack’ data & send to one processing node

Classic bottleneck problem for the switch

Concentrators


XpressFX Vertex4 Network Test Board XpressFX Development Card from PLDApplications

8 lane PCI-e card Xilinx Virtex4FX60 FPGA DDR2 memory 2 SFP cages – 1GigE 2 HSSDC connectors


Overview of the Firmware Design Virtex4FX60 has:

16 RocketIP Multi-Gigabit Tranceivers Large internal memory 2 PPC CPUs

Ethernet Interface Embedded MAC RocketIO

Packet Buffers & logic Allows routing of input Prioritising of output

Packet State Machine Packet Generator State Machines

VHDL model HC11 CPU Control of MAC

State Machines (Green Mountain Computer Systems)

Reserve the PPC for data processing


The State Machine Blocks Packet Generator CSRs (set by HC11) for

Packet length Packet count Inter-packet delay Destination Address

Request – Response RX State Machine

Decode Request Packet Checksum RFC768 Action Mem writes Q Other Requests

FIFO TX State Machine

Process Request Construct reply Fragment if needed Checksum

Packet Analyser

PacketAnalyser

StateMachine


The Receive State Machine Idle

ReadHeader

ReadCmd

CheckCmd

DoCmd

WriteMem

FillFifo

EmptyPacket

Packet in Queue

Correct packet type

All bytes received

Good cmdIs a memory write

Write finished

Fifo written

End of packet

Wrong packet type

Not a memory write

Bad cmd

Fifo has: Addresscmd


The Receive State Machine

Idle

SendHeader&cmd

CheckCmd

AllSent?

UpdateCounter

SendXsum

SendMemory

EndPkt

cmd needs no data

All bytes have been sent

Header & cmd sent

More data to send

cmd requires data

Max packet size or byte count done

cmd in fifoEnd of packet

Xsum sent


The Test Network

Responding nodes

Cisco 76091 GE and 10 GE blades

Requesting Node

FPGA Concentrator

Use for testing Raw Ethernet Frame generation by the FPGA

Test Data collection with Request-Response protocols


Request-Response Latency 1 GE Request sent from PC Linux Kernel 2.6.20-web100_pktd-

plus Intel e1000 NIC Interrupt Coalescence OFF on PC MTU 1500 bytes

Response Frames generated by FPGA code

Latency 19.7 µs well behaved Latency Slope 0.018 µs/byte B2B Expect: 0.0182 µs/byte

Mem 0.0004 PCI-e 0.0018 1GigE 0.008 FPGA 0.008

Smooth to 35,000 bytes

man2-fpga

y = 0.0183x + 19.719

0

10

20

30

40

50

60

0 200 400 600 800 1000 1200 1400

Message length bytes

Lat

ency

us

man2-fpga

y = 0.0085x + 30.578

0

50

100

150

200

250

300

350

0 5000 10000 15000 20000 25000 30000 35000

Message length bytes

Lat

ency

us


FPGA PC ethCal_recv : Frame jitter

25 us frame spacing

12 us frame spacing (line speed)

Peak separation 4-5 us no coalescence

12 us fpga-man2_21Apr07

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

0 10 20 30 40 50Frame spacing us

N(t

)


0

100000

200000

300000

400000

500000

600000

700000

800000

0 50 100 150Frame Spacing us

N(t

)


1

10

100

1000

10000

100000

1000000


N(t

)

Packet loss


0

20

40

60

80

0 1 2 3 4 5 6 7 8 9Num Packets Lost

N(n

)


0

5

10

15

20

25

0 1 2 3 4 5 6 7 8 9Num Packets lost

N(t

)


1

10

100

1000

10000

100000

1000000


N(t

)


Test the Frame Spacing from the FPGA Frames generated by FPGA code Interrupt Coalescence OFF on PC Frame size 1472 bytes 1M packets sent.

Plot mean of observed frame spacing vs requested spacing

Appear have offset of -1 us ? Slope close to 1 as expect

Packet loss decreases with packet rate.

Packet lost in receiving host

Larger effect than UDP/IP packets UDP/IP losses linked to scheduling

fpga-man2_21Apr07

y = 0.993x - 1.0726

0

5

10

15

20

25

0 10 20 30Requested frame spacing us

Mea

n f

ram

e sp

acin

g

us

fpga-man2_21Apr07

y = -93.531x + 2563.7

0200400600800

10001200140016001800

0 10 20 30Requested frame spacing us

Nu

m f

ram

es lo

st


The Test Network

Responding nodes

Requesting Node

FPGA Concentrator

Use for testing Raw Ethernet Frame generation by the FPGA

Test Data collection with Request-Response protocols

This time use 10GE hosts But does 10GE work on a PC??

Cisco 76091 GE and 10 GE blades


10 GigE Back2Back: UDP Throughput Motherboard: Supermicro X7DBE Kernel: 2.6.20-web100_pktd-plus NIC: Myricom 10G-PCIE-8A-R Fibre

rx-usecs=25 Coalescence ON

MTU 9000 bytes Max throughput 9.4 Gbit/s

Notice rate for 8972 byte packet

~0.002% packet loss in 10M packetsin receiving host

Sending host, 3 CPUs idle For <8 µs packets,

1 CPU is >90% in kernel modeinc ~10% soft int

Receiving host 3 CPUs idle For <8 µs packets,

1 CPU is 70-80% in kernel modeinc ~15% soft int

gig6-5_myri10GE

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 10 20 30 40Spacing between frames us

Re

cv W

ire r

ate

Mb

it/s 1000 bytes

1472 bytes

2000 bytes

3000 bytes

4000 bytes

5000 bytes

6000 bytes

7000 bytes

8000 bytes

8972 bytes

gig6-5_myri10GE

0

20

40

60

80

100

0 5 10 15 20 25 30 35 40Spacing between frames us

%c

pu

1 k

ern

el

sn

d

1000 bytes

1472 bytes

2000 bytes

3000 bytes

4000 bytes

5000 bytes

6000 bytes

7000 bytes

8000 bytes

8972 bytes

C

gig6-5_myri10GE

0

20

40

60

80

100

0 10 20 30 40Spacing between frames us

% c

pu

1

ke

rne

l re

c

1000 bytes

1472 bytes

2000 bytes

3000 bytes

4000 bytes

5000 bytes

6000 bytes

7000 bytes

8000 bytes

8972 bytes


Scaling of Request-Response Messages

Requests from 10GE system Interrupt Coalescence OFF on PC Frame size 1472 bytes 1M packets sent. Request 10,000 bytes of data Host does fragment collection

like the IP layer

Sequential Requests: Time to receive all responses scales

with round trip time. As expected from sequential

requests

Grouped Requests: Collection time increases by 24.6µs

per node. From network alone expect

1+12.3 = 13.3 µs

gig5-fpga_29Apr07

y = 130.41x + 31.035

y = 24.615x + 126.580

100

200

300

400

500

600

0 1 2 3 4 5Number of nodes

Req

-res

p t

ime

us

●●●

Time ●●●

Time


Sequential Request-Response Interrupt Coalescence OFF

on PCs MTU 1500 bytes 10,000 packets sent.

Histograms similar Strong 1st peak Second peak 5 µs later Small group ~25 µs later

Ethernet occupancy for 1500 bytes: 1Gig 12.3 µs 10Gig 1.2 µs

h2 n1 10GE-->fpga

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

250 270 290 310 330 350 370 390

Latency us

N(t

)

h4 n1 10GE-->fpga

0

500

1000

1500

2000

2500

3000

3500

4000

4500

500 520 540 560 580 600 620 640Latency us

N(t

)

h3 n1 10GE-->fpga

0

500

1000

1500

2000

2500

3000

3500

4000

350 370 390 410 430 450 470 490Latency us

N(t

)


h4 n4 10GE-->fpga

0

500

1000

1500

2000

2500

3000

150 170 190 210 230 250 270 290Latency us

N(t

)

Grouped Request-Response Interrupt Coalescence OFF

on PCs MTU 1500 bytes 10,000 packets sent.

Histograms multi-nodal Second peak ~ 7 µs later Small group ~25 µs later

h2 n2 10GE-->fpga

0

500

1000

1500

2000

2500

3000

3500

4000

4500

150 170 190 210 230 250 270 290Latency us

N(t

)

h3 n3 10GE-->fpga

0

500

1000

1500

2000

2500

150 170 190 210 230 250 270 290Latency us

N(t

)


Conclusions Implemented MAC and PHY layers inside Xilinx Virtex4 FPGA Learning curve steep had to overcome issues with

Xilinx “CoreGen” design Clock generation & stability on PCB

FPGA easily drives 1Gigabit Ethernet at line rate Packet dynamics on the wire as expected Loss of Raw Ethernet frames in end host being investigated

Request-Response style data collection promising Developing a simple Network test system Planned upgrade to operate at 10Gbit/s

Work performed in collaboration with ESLEA UK e-Science & EU EXPReS projects:


Any Questions?


10 GigE UDP Throughput vs packet size Motherboard: Supermicro X7DBE Linux Kernel 2.6.20-web100_

pktd-plus Myricom NIC 10G-PCIE-8A-R Fibre myri10ge v1.2.0 + firmware v1.4.10

rx-usecs=0 Coalescence ON MSI=1 Checksums ON tx_boundary=4096

Steps at 4060 and 8160 byteswithin 36 bytes of 2n boundaries

Model data transfer time as t= C + m*Bytes C includes the time to set up transfers Fit reasonable C= 1.67 µs m= 5.4 e4 µs/byte Steps consistent with C increasing by 0.6 µs

The Myricom drive segments the transfers, limiting the DMA to 4096 bytes – PCI-e chipset dependent!

gig6-5_myri_udpscan

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 2000 4000 6000 8000 10000Size of user data in packet bytes

Rec

v W

ire

rate

Mbi

t/s

collecting data over the network

Documents

talkscollecting data

queuecorrect packet

packet rate

frame spacing12

frame jitter

datamax packet size

schedulingthe test network

ge request