fpga: from flashing led to reconfigurable computing

82
Mar. 2009 Wu Jinyuan, Fe rmilab jywu168 @fnal.gov 1 FPGA: From Flashing LED to Reconfigurable Computing Wu, Jinyuan Fermilab IIT Mar, 2009

Upload: rowena

Post on 08-Feb-2016

120 views

Category:

Documents


0 download

DESCRIPTION

FPGA: From Flashing LED to Reconfigurable Computing. Wu, Jinyuan Fermilab IIT Mar, 2009. Outline. Electronic Aspect of FPGA: LED Flashing Logic Elements in a Nutshell TDC and ADC FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

1

FPGA: From Flashing LED to Reconfigurable Computing

Wu, JinyuanFermilabIITMar, 2009

Page 2: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

2

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 3: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

3

Flashing LED, The First Thing First

CounterQ[23..0]

At least design an LED for an FPGA. When a board is first powered up, first

test the LED flashing function. Many things have to be right so that the

LED flashes: Power pins must be all connected. Configuration devices must be in correct mode. Design software must be correct.

Page 4: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

4

LED Brightness Variation

CounterQ[23..0]

A

B

A<B

LUT

CounterQ[23..0]

A

B

A<B

The LED brightness is varied by changing the output pulse duty-cycle.

Comparator input A is the brightness and B is the clock cycle count.

Look-up table can be added to input A for different brightness variation curve.

Page 5: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

5

Duty-Cycle Based Single-Pin DAC (1)

The duty-cycle or pulse width of the comparator output is proportional to the DAC input at port A.

Use external RC as low-pass filter. Output voltage of an ideal LP filter is proportional to the

DAC input.

0

1

2

3

4

896 960 1024

CounterQ

A

B

A>B

DAC Input

Page 6: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

6

LED Brightness Exponential Drop

Counter

Q

A

B

A<BCO

Q

SET

D

if (CO==1) {Q = Q - Q/32;}

Narrow pulse are typically stretched for LED display with fix brightness.

The circuit here provides gradually dim of the LED for better visual effect.

Page 7: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

7

Exponential Sequence Generator

Q

SET

D

if (CO==1) {Q = Q - Q/32;}

0

10000

20000

30000

40000

50000

60000

70000

0 20 40 60 80 100 120 140 160

An exponential sequence is generated using an accumulator shown above.

Note that not even one multiplier is used. Other function sequences: sine, co-sine, tangent, co-

tangent etc. can also be generated similarly.

Possible

Student Lab

Page 8: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

8

Duty-Cycle Based Single-Pin DAC (2)

Use carry-out of the accumulator as the output. The number of pulses is proportional to the DAC input. Rounding error is carried to later cycles. Output is smoother.

0

1

2

3

4

896 960 1024

Q

CO

DDAC Input

Possible

Student Lab

Page 9: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

9

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 10: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

10

Logic Elements

D Q

ENACLRN

LUT4(16 RAM

Cells)

D Q

ENACLRN

LUT38 Cells

LUT38 Cells

NormalMode:

ArithmeticMode:

LUT4 + DFF

2 x LUT3 + DFF

ABCD

CI

A

B

CO

LUT = Look-Up Table

Page 11: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

11

What Can Be Done With a Lookup Table

“Any” 4-inFunctions

ABCD

Page 12: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

12

Xilinx Look-Up Table

D Q

ENACLRN

RAM16

4-input Look-Up Table

16-bitShift Register

16-bitDistributed RAM

SRL16

LUT4

Page 13: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

13

Pipeline Structure

D Q

ENACLRN

LUT4(16 RAM

Cells) D Q

ENACLRN

LUT4(16 RAM

Cells)

D Q

ENACLRN

LUT4(16 RAM

Cells)

LUT4(16 RAM

Cells)

Logic cells are usually designed in pipeline structures.

Page 14: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

14

Logic Element as a Full Adder Bit

D Q

ENACLRN

LUT38 Cells

LUT38 Cells

CI

A

B

D Q

ENACLRN

LUT38 Cells

LUT38 Cells

A

B

CO A Logic cell resembles a full adder bit.

Page 15: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

15

Myths on FPGA We commonly heard about FPGA:

FPGA is cheap. FPGA is fast. FPGA is large. FPGA can do anything.

Not really, at least it is not always the case. The reality is:

FPGA is ultra-flexible. As the cost of the flexibility, the transistor usage in FPGA is

NOT efficient. Good design tricks are needed.

Page 16: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

16

4-Input NAND, 4-Input NOR, 4-Input NAOR

A B C D

A

B

C

D

Y

A B C D

A

B

C

D

Y

A B

C D

A

B

C

D

Y

8 transistors each

ABCD

ABCD

ABCD

Y Y Y

In ASIC

Page 17: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

17

Transistor Usage of Logic Element

D Q

ENACLRN

LUT16-bit

6-transistor RAM bit

At least 96 transistors

X 16

In FPGA

Page 18: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

18

The Mirror Adder (Weste93)

A

A

B

B

CiCob

Sb

A

B

A

B

A B Ci

A B Ci

A

B

A

B

Ci

Ci

24-28 transistors In ASIC

Page 19: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

19

Full Adder

D Q

ENACLRN

LUT8-bit

LUT8-bit

FullAdder

CI

AB

S

CO

D Q

At least 96 transistors

In FPGA

Page 20: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

20

Other FPGA Resources Other resources are available in FPGA devices:

RAM Blocks Multipliers Serial Data Receivers, Power PC, etc.

Multipliers RAM Blocks

16 LogicElements

Page 21: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

21

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 22: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

22

TDC Using FPGA Logic Chain Delay

This scheme uses current FPGA technology

Low cost chip family can be used. (e.g. EP2C8T144C6 $31.68)

Fine TDC precision can be implemented in slow devices (e.g., 20 ps in a 400 MHz chip).

IN

CLK

Page 23: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

23

Two Major Issues In a Free Operating FPGA

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

wid

th (p

s)

1. Widths of bins are different and varies with supply voltage and temperature.

2. Some bins are ultra-wide due to LAB boundary crossing

Page 24: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

24

0

500

1000

1500

2000

2500

0 16 32 48 64

bin

time

(ps)

Auto Calibration Using Histogram Method It provides a bin-by-bin calibration at

certain temperature. It is a turn-key solution (bin in, ps out) It is semi-continuous (auto update

LUT every 16K events)

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

wid

th (p

s)

DNLHistogram

In (bin)LUT

Out (ps)

16KEvents

Page 25: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

25

The Test Module

Two NIM inputs

FPGA with 8ch TDC

Data Output via Ethernet

BNC Adapter to add delay @

150ps step.

Page 26: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

26

Test ResultNIM Inputs

0 1 2

RMS 10ps

LeCroy 429ANIM Fan-out

NIM/LVDS

NIM/LVDS

-

140ps

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

+

+BNC adapters to add delays @ 140ps step.

As good as ASIC TDC

Page 27: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

27

Multi-Sampling TDC FPGA c0

c90

c180

c270

c0

MultipleSampling

ClockDomain

Changing

Trans. Detection& Encode

Q0

Q1

Q2

Q3QF

QE

QD

c90

Coarse TimeCounter

DVT0T1

TS

Ultra low-cost: 48 channels in $18.27 EP2C5Q208C7.

Sampling rate: 360 MHz x4 phases = 1.44 GHz.

LSB = 0.69 ns.

4Ch

Logic elements with non-critical timing are freely placed by the fitter of the compiler.

This picture represent a placement in Cyclone FPGA

Page 28: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

28

ADC Using FPGAAMP &Shaper

AMP &Shaper

AMP &Shaper

AMP &Shaper

AMP &Shaper

AMP &Shaper

AMP &Shaper

AMP &Shaper

ADC

ADC

ADC

ADC

FPGA

TDC

TDC

TDC

TDC

R1 R1

C

R2

FPGA

VREF

Analog signals from AMP & Shapers are directly fed to FPGA pins.

FPGA outputs and passive RC network are used to generate ramping reference voltage VREF.

The input voltages and VREF are compared using FPGA differential input receivers.

The times of transitions representing input voltage values are digitized by TDC blocks in FPGA.

T1 T2 T3 T4

V1 V2V3 V4

V1 V2V3 V4

T1 T2 T3 T4

Page 29: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

29

ADC Test: Waveform Digitization on BD3_19

1

1.5

2

2.5

2500 3000 3500 4000 4500 5000 5500

t(ns)

V

Leading Ramp Trailing Ramp

0

8

16

24

32

40

48

56

64

0 32 64 96 128 160 192 224 256

Leading Ramp Trailing Ramp

RawData

Input Waveform, Overlap Trigger& Reference Voltage

Converted

FPGA

TDC

TDC

50 50

1000pF

100

VREFPossib

le

Student Lab

Page 30: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

30

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 31: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

31

Moore’s Law

Number of transistors in a package: x2 /18months

Taken from www.intel.com

Page 32: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

32

Status of Moore’s Law: an Inconvenient Truth

# of transistors Yes, via multi-core.

Clock Speed ?

Taken from www.intel.com

Page 33: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

33

The Fever of Moore’s Law vs. Maxwell’s Equations

tDJH

tBE

BD

0

1998 2000 2002 2004 2006 2008 2010

Op/sec

MIT, 2002

During the hot days of Moore’s Law, the rules of thumb are: BRB – Buy Rather than Build URU – Use Rather than Understand WRW – Wait Rather than Work

From fundamental principles like Maxwell’s Equations, it is known limits of Moore’s Law exist. The technology advance comes from hard work.

WRW

Page 34: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

34

The Execution & Non-Execution Cycles

In current micro-processors: Each instruction takes one clock cycle to execute. It takes many clock cycles to prepare for executing an instruction. Pipelined? Yes. But the non-execution pipeline stages consume silicon

area, power etc. To execute an instruction != to do useful calculation.

Can we do something different?

From MIT 6.823 Open Course Site

Page 35: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

35

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 36: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

36

The Space Charge Computing

n

i ij

ijijj

qq

13

04 rr

rrF

Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N2 and each calculation of the Coulomb force

requires a square root, a division and several multiplications. Regular sequential computers are not fast enough.

Number ofElectrons

Number of Calculations/Iteration

Computing Time/1000 Iterations @107 Calculations/s

103 ~106 100 s

104 ~108 2.7 hours

105 ~1010 11.6 days

106 ~1012 3.2 years

Page 37: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

37

The FPGA Board

Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core.

Page 38: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

38

xi

- X

X

X

LUT10b in16b out

yi

zi

16-bitCoordinates

32-bitForces

xj

yj

zj

vzj+

vyj+

vxj+

x2

x2

x2

+

-

-

+

++

16-bitVelocities

The 16-bit Demo Core

n

i ij

ijijj

qq

13

04 rr

rrF

dtjjj avv 0

dtkkk vxx 0

Page 39: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

39

LUT10b in16b out

x2

x2

x2

+

The Lookup Table

n

i ij

ijijj

qq

13

04 rr

rrF

0

4096

8192

12288

16384

20480

24576

28672

32768

0 256 512 768 1024

2/3

1x

y

Page 40: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

40

Two Electrons with Natural Scales

0.0000158

0.000016

0.0000162

0.0000164

0.0000166

0.0000168

0.000017

0.0000172

0.0000174

0.0000176

0.0000178

0 5 10 15 20 25

steps

dist

ance

(m)

Calculated x2Calculated x1Actual x2Actual x1256 nm

28ps

Page 41: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

41

256 Charged Particles, Iteration 0

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 42: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

42

256 Charged Particles, Iteration 5

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 43: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

43

256 Charged Particles, Iteration 10

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 44: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

44

256 Charged Particles, Iteration 15

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 45: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

45

256 Charged Particles, Iteration 20

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 46: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

46

256 Charged Particles, Iteration 25

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 47: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

47

256 Charged Particles, Iteration 30

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 48: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

48

256 Charged Particles, Iteration 35

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 49: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

49

256 Charged Particles, Iteration 40

10000

15000

20000

25000

30000

35000

5000 10000 15000 20000 25000 30000 35000 40000

X'''

Y'''

Page 50: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

50

Speed Comparison with Regular CPU

The FPGA core is x10 faster than a typical 2.2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs 80-100 clock cycles for each Coulomb force calculation.

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

0 10000 20000 30000 40000 50000 60000 70000 80000 90000

Number of Particles

Tim

e (s

)

CPU: 2.2GHz Intel Core 2 Duo FPGA: EP2C8T144C6, 200MHz

Page 51: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

51

One Board: 8 FPGA Cores

One board has a calculation capacity as 40 dual core CPUs.

The power consumption of one board is < 4.5 W. Newer FPGAs capable of hosting 4 cores/FPGA are

available.

One Core/FPGA= 5 Dual Core CPUsOne Core/FPGA= 5 Dual Core CPUs

8 Cores/Board= 40 Dual Core CPUs

Page 52: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

52

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 53: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

53

Example of Doublet Match, PET

Positrons and electrons annihilate to produce pairs of photons. The back-to-back photons hit the detector at nearly the same time.

Detector hits are digitized and hits at nearly the same time are to be matched together.

The process takes O(n^2) clock cycles.

T

D

T

D

Group 1

Group 2-

T<A?

T>(-A)?

Page 54: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

54

Hash Sorter

K

K

D

K

D

Pass 1: Data in Group 1 are

stored in the hash sorter bins based on key number K.

Pass 2: Data in Group 2 are

fetched though and paired up with corresponding Group 1 data with same key number K.

Group 1

Group 2

Page 55: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

55

DIN DOUT

Index RAM

Pointer RAM

DATA RAM

K

Link List Structure of Hash Sorter

Page 56: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

56

Hash Sorter

K

Using hash sorter, matching pairs can be grouped together using 2n, rather than n2 clock cycles.

Page 57: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

57

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 58: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

58

Hits, Hit Data & Triplets

• Hit data come out of the detector planes in random order.

• Hit data from 3 planes generated by same particle tracks are organized together to form triplets.

Page 59: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

59

• Three data items must satisfy the condition: xA+ xC = 2 xB.

• A total of n3 combinations must be checked (e.g. 5x5x5=125).

• Three layers of loops if the process is implemented in software.

• Large silicon resource may be needed without careful

planning: O(N2)

Triplet Finding

Plane A Plane B Plane C

Page 60: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

60

Tiny Triplet Finder OperationsPass I: Filling Bit Arrays

Note: Flipped Bit Order

Physical Planes

Bit Array/Shifters

For any hit… Fill a corresponding logic cell.

• xA+ xC = 2 xB

• xA= - xC + constant

Page 61: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

61

Tiny Triplet Finder Operations Pass II: Making Match

For any center plane hit…

Logically shift the

bit array.

Perform bit-wise AND in this range.

Triplet is found.

Physical Planes

Bit Array/Shifters

Page 62: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

62

Tiny? Yes, Tiny! – Logic Cell Usage:

AM, CAM, Hough Transform

etc., O(N2)Tiny Triplet FinderO(N*logN)

Page 63: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

63

Hit MatchingSoftware FPGA

TypicalFPGA Resource Saving Approaches

O(n2)for(){ for(){…}}

O(n)*O(N)ComparatorArray

Hash SorterO(n)*O(N): in RAM

O(n3)for(){ for(){ for(){…} }}

O(n)*O(N2)CAM,Hugh Trans.

Tiny Triplet FinderO(n)*O(N*logN)

O(n4)for(){ for(){ for(){ for() {…}}}}

Page 64: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

64

The Winning Line of FPGA Computing

We commonly heard: FPGA devices contains millions gate. High parallelism can be implemented in FPGA. FPGA cost drops by half every 18 months.

We want to emphasize, especially to our young students:

1. Creativity,2. Creativity,3. Creativity, on Arithmetic ops, on Algorithms, on

Architectures & on All Aspects.

O Freunde, nicht diese Töne!

Page 65: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

65

Outline Electronic Aspect of FPGA:

LED Flashing Logic Elements in a Nutshell TDC and ADC

FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)

Page 66: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

66

The End

Thanks

Page 67: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

67

Micro-computing vs. Reconfigurable Computing

In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful

experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated.

(100+3-4)*5+7 =?

100

34

57Control:

Data: 100,3,4,5,7

LD (-) (+)(*)(+)

CPUFPGAData

ProgramConfiguration

DataProgram

Page 68: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

68

FPGA Process Sequencing Options

ProgramType

ProgramLength(CLK cycles)

Reprogram ResourceUsage

Finite State Machine(FSM)

FixedWired

10 Hard Small

Enclosed Loop Micro-Sequencer(ELMS)

MemoryStoredProgram

10-1000 Easy Small

Microprocessor(MP)

MemoryStoredProgram

>1000 Easy Large

Page 69: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

69

The Between Counter

0,1,2,3,4,5,6,7,8,9,A

5,6,7,8,9,ASLOAD

D[]

SCLR

N Q[]

M-1==

A[]

B[]

T

5,6,7,8,9,A

5,6,7,8,9,A

5,6,7,8,9,A

5,6,7,8,9,A,B,C,D,E,F…

PC0: instr0PC1: instr1PC2: instr2PC3: instr3PC4: instr4PC5: instr5PC6: instr6PC7: instr7PC8: instr8PC9: instr9PCA: instrAPCB: instrBPCC: instrCPCD: instrD

TROMBetween

CounterControlSignals

Page 70: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

70

ELMS– Enclosed Loop Micro-Sequencer

Loop & Return Logic + Stack

Conditional Branch Logic

ProgramCounter

ROM128x

36bits

AReset

CLK Con

trol S

igna

ls

PC Control Signals Opration00 000000000000000 01 001000100011010 LD R1, #n02 000010001000000 LD R2, #addr_a03 000000000000100 LD R3, #addr_X04 000000010001000 LD R7, #005 000000000100001 BckA1 LD R4, (R2)06 000100000010000 INC R207 000001000100000 LD R5, (R3)08 000100010000001 INC R309 001001000100000 MUL R6, R4, R50a 000000010001000 EndA1 ADD R7, R7, R60b 000010000010000 DEC R10c 000000100000100 BRNZ BckA1

Special in ELMSSupports FOR loops at machine code level

PC+ROM is a good sequencer in FPGA.

Adding Conditional Branch Logic allows the program to loop back.

Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level.

Allows jump back as in microprocessors

Page 71: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

71

ELMS – Detailed Block Diagram

UserControlSignals

ROM128x

36bits

+1

CondJMP

PC

Reset

Loop & Return Registers

+ Stack (128 words)

Compare

RTNJMPIF

CNT

endA

bckA

PushPop

LoopBack

DEC

RTN

LastPass

LoopBack = DEC =(PC==endA) && (CNT!=0)

LastPass =(PC==endA) && (CNT==1)

desA

JMP

0x04

RUNat04 cnt EndA BckA

FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0

BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5

EndA1 ADD R7, R7, R6LD R8, R7

The Stack supports nested loops and sub-routing calls up to 128 layers.

Page 72: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

72

Software: Using Spread Sheet as Compiler

Page 73: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

73

What’s Good About ELMS: FOR Loops at Machine Code Level w/ Zero-Over Head

Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. Execution time is saved and micro-complexities (branch penalty, pipeline bubble, etc.)

associated with conditional branches are avoided.

LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0

BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5

EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1

FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0

BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5

EndA1 ADD R7, R7, R6

n

iii XaY

0

25%

Microprocessor The ELMS

Conditional Branch

Page 74: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

74

ELMS as a Hardware Loop Sequencer

Loop & Return Logic + Stack

Conditional Branch Logic

ProgramCounter

ROM128x

36bits

AReset

CLK Con

trol S

igna

ls

There are DSP devices that support hardware loop for zero-overhead loop implementation. The emphasis of ELMS is that the FOR loop and subroutine calls/return are treated the same. Any program passage can be used as a subroutine without needing a return instruction. The ELMS uses as less resource as possible for FPGA implementation.

From http://www.analog.com/

Page 75: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

75

No ALU => Small Resource Usage

ProgramDATA

Memory

PrincetonArchitecture

HarvardArchitecture

Fermilab (?)Architecture

ProgramControl

ALU

ProgramMemory

ProgramControl

ALUDATAMemory

ProgramMemory

Sequencer(ELMS)

Data Processor

DATAMemory

The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level.

Regular microprocessors cannot run looped program without an ALU.

The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA.

The ELMS can run nested loop program without an ALU.

Further separation of Program and data is therefore possible.

The ELMS is kept small.

The von NeumannArchitecture

Page 76: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

76

The Frequency Spectrum of DAC (2)

0

1

2

3

4

896 960 1024

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

Q

CO

DDAC Input

The first harmonic may be suppressed. Works better with regular low-pass

filters.

Possible

Student Lab

Page 77: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

77

The Frequency Spectrum of DAC (1)

CounterQ

A

B

A>B

DAC Input

0

1

2

3

4

896 960 1024

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

The first harmonic has dominate concentration.

Works better with notch filter.

Page 78: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

78

Digital Calibration Using Twice-Recording Method

IN

CLK

Use longer delay line. Some signals may be

registered twice at two consecutive clock edges.

N2-N1=(1/f)/t

The two measurements can be used: to calibrate the delay. to reduce digitization errors.

1/f: Clock Periodt: Average Bin Width

Page 79: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

79

TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5VCCINT (V)

TDC

Out

puts

N1n2

TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5VCCINT (V)

TDC

Out

puts

N1n2Tc

Digital Calibration Result Power supply voltage

changes from 2.5 V to 1.8 V, (about the same as 100 oC to 0 oC).

Delay speed changes by 30%.

The difference of the two TDC numbers reflects delay speed.

2nd TDC

1st TDCCorrected Time

)()(

0112

01 NNLT

NNNN

TTc

Warning: the calibration is based on average bin width, not bin-by-bin widths.

Page 80: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

80

Indirect Cost of ComplexityIf something like this can do the job…

… why do these?

Page 81: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

81

Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns

C1

C2

C3

One set of coincident logic is implemented.

For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.

Page 82: FPGA: From Flashing LED to Reconfigurable Computing

Mar. 2009 Wu Jinyuan, Fermilab [email protected]

82

Tiny Triplet Finder for Circular Tracks

*R1/R3

*R2/R3Triplet Map Output To Decoder

Bit

Arr

ay

Shifter

Bit

Arr

ay

ShifterBit-wise Coincident Logic

0

16

32

48

64

80

96

112

128

0 16 32 48 64 80 96 112 128

1. Fill the C1 and C2 bit arrays. (n1 clock cycles)

2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)

Also works with more than 3 layers