a scalable pipelined architecture for biomimetic vision...

A Scalable Pipelined Architecture for Biomimetic Vision Sensors

DANIEL LLAMOCCA AND BRIAN K. DEAN

Electrical and Computer Engineering Department,

Oakland University

September, 3rd, 2015

Outline Motivation

Contributions

Methodology

Results

Conclusions

Motivation Fly-inspired vision algorithms can outperform traditional

image processing algorithms in motion detection and in-flight obstacle tracking and interception. Applications include high-speed target tracking for unmanned

aerial and ground vehicles, structural monitoring. Dedicated hardware implementations are desired when the

large amounts of data are to be processed in parallel.

OpticalElectrical Interface

7

Opticalfront-end

7

ADCs

FilterLight

Adaptation

C1-7

Supportinghardware

FLY-INSPIRED VISION SENSOR

B B

BO

1 2

3 4 5

6 7

FO1-7

LA

O1

-7

Vision Sensor: Optical front-end

Optical electrical interface

Analog-to-digital converters

Supporting digital hardware for filtering and light adaptation.

Motivation Vision Sensor: Optical front-end

Plano-convex lens (12 mm diameter and 12 mm focal length) placed above seven photodiodes arranged in a hexagonal pattern

Hexagonal pattern approximates fly optical arrangement.

Photoreceptor response: overlapping Gaussians

Contributions Fully pipelined and Scalable Hardware

Implementation for the Biomimetic Sensor The fully-customizable fixed-point architecture allows

users to quickly modify design parameters (# of input bits, output format, # of bits per of iterations, # of bits of filters’ coefficients).

Fully-pipelined architecture is achieved by unrolling the IIR filter architecture.

Generic VHDL code validated on an FPGA The fully-parameterized RTL VHDL code is not tied to a

particular device or vendor. Design Space Exploration

The fully-parameterized VHDL code allows us to create a set of different hardware profiles by varying the design parameters. We can then explore trade-offs among design parameters, accuracy, resources, andexecution time.

Methodology Block Diagram: Data path uses fixed-point representation:

Input: [B B-1], Output/Intermediate Signals [BO BQ] Design Parameters: B, BO, BQ, NH (# of bits per filters’ coefficients)

60 Hz IIRNotch Filter

BC1

BO FO1


BC2

BO FO2


BC3

BO FO3


BC4

BO FO4


BC5

BO FO5


BC6

BO FO6


BC7

BO FO7

Average Unit

BO AVG

0.159 Hz Low-PassDA FIR Filter

N=24,L=6,symmetricE

BO

AVG_FILT

v v

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

...

RL_AVG + RL_FIR

E

E

-+

RL_FIRRL_AVG

RL_IIR

BO LAO7

-+

BO LAO1

-+

BO LAO2

-+

BO LAO3

-+

BO LAO4

-+

BO LAO5

-+

BO LAO6

v

v

NH

BQ

B

BO

Architecture:7 IIR filters,1 average unit,1 FIR filter,7 subtractors,7 chain registers(synchronization registers that allow for pipelining)

𝑅𝐿_𝐴𝑉𝐺 = 𝐵𝑂 + 2 log2𝑁𝑅𝐿_𝐹𝐼𝑅 = log2 𝐵𝑂 + 1 + log2𝑁/2𝐿 + 2𝑅𝐿_𝐼𝐼𝑅 = 7

Methodology IIR Filter (60 Hz Notch filter): fs=1 KHz, 2nd order IIR filter

Direct implementation: data dependencies prevent pipelining Look-ahead transformation: The 2nd order IIR filter is turned

into a 4th order IIR filter with no data dependencies.

𝐻 𝑧 =𝑏0+𝑏1𝑧

−1+𝑏2𝑧−2

1+𝑎1𝑧−1+𝑎2𝑧

−2 𝐻𝑃(𝑧) =𝑏𝑝0+𝑏𝑝1𝑧

−1+𝑏𝑝2𝑧−2+𝑏𝑝3𝑧

−3+𝑏𝑝4𝑧−4

1+𝑎𝑝2𝑧−2+𝑎𝑝4𝑧

−4

Scattered look-ahead decomposition with powers of 2: coefficients of the 4th order IIR filter avoid instability.

X_in B

+

NH

Addertree

bp(0)

B

bp(1)NH

bp(2)NH

bp(3)

bp(4)NH

NH

IIR FILTER

+Addertree

NH-ap(2)

NH-ap(4)

BO

FO

w[n]

y[n-4]y[n] y[n-2]

clock

w[n] w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]

y[n]

y[n-1]

y[n-2]

y[n-3]

y[n-4]

y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9]

y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8]

y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7]

y[0] y[1] y[2] y[3] y[4] y[5] y[6]

y[0] y[1] y[2] y[3] y[4] y[5]

y[n] = w[n] - ap(2)*y[n-2] - ap(4)*y[n-4]

Assumption: The adder tree does not have delay units

Methodology IIR Filter (60 Hz Notch filter): fs=1 KHz, 2nd order IIR filter

Retiming: An actual adder tree usually includes register levels in order to increase the frequency of operation.

For example, a 3-input adder tree usually has 2 register levels. The delay breaks the pipeline of the previous figure.

Retiming is used here to address this issue: the delay units that create y[n-2] are embedded into the two register levels of the adder tree.

+

+

NH-ap(4)

NH-ap(2)

BO

FO

+

+

NH-ap(4)

NH-ap(2)

BO

FO

w[n]

w[n]

clock

w[n] w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]

w[n-2]

y[n-2]

y[n-4]

w[0] w[1] w[2] w[3] w[4] w[5] w[6] w[7] w[8] w[9]

y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7]

y[0] y[1] y[2] y[3] y[4] y[5]

y[n] = w[n] - ap(2)*y[n-2] - ap(4)*y[n-4]

Assumption: The adder tree has two delay units

y[n]

-ap(2)*y[n-2] -ap(4)*y[n-4] + w[n-2]

y[n] y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9]

Methodology FIR Filter with Distributed

Arithmetic Efficient multiplier-less

implementation where coefficients are constant.

Cut-off frequency: 0.159 Hz. Stopband: -41dB 24-tap symmetric low-pass

filter Fully pipelined system with I/O

delay of RL_FIR. LUT input size = 6 Coefficients format:

[NH NH-1] Constant coefficients loaded as

a text file.

BO

0.159 Hz Low-PassDA FIR Filter

N=24,L=6,symmetricE

BO

v

RL_FIR

...

s[0]s[M-2]

s[0]:

s[L-1]:

X

x[0]

x[N-1] x[M+1]

x[M]x[M-1]x[1]

x[N-2]B

only whenN is even

B+1

E

E E E

E E E

E

sB

s0

+

Addertree

s[M-1]

...

...

-2B

20...

LUT L-to-LO

LUT L-to-LO

Filter Block 0

s[L]:

s[2L-1]:

...

Filter Block 1

s[M-L]:

s[M-1]:

...

...

Filter Block M/L-1

LO=NH+log2L

x[M]

x[M

-1]...

...

...

sB

s0

...

-2B

20...

LUT L-to-LO

LUT L-to-LO

...

sB

s0

-2B

20...

LUT L-to-LO

LUT L-to-LO

B+1 B+1

sB s1 s0...

sB s1 s0...

sB s1 s0...

+

+

+

Y

X Y

BO

BO

L

B

N

NH

M=N/2

Methodology Averaging unit

The seven outputs FOx (x=1..7) are averaged out by this block. This requires a 7-input pipelined adder tree and an array divider.

X(0) X(1)

+++

++

+

0

BY

+/-

Yo

ARRAY

DIVIDER

+/-

BY=BO+log2(N)

BY

log2(N+1)N=7

+/-

+/-

BY

BY

BY0BY

BO BO BO BO BO BO BO

BY BO

X(2) X(3) X(4) X(5) X(6)

v

E

BY

+lo

g2 (N

)

N

BOAVG

BY

Input format: [BO BQ] Output format: [BO BQ] I/O delay:𝑅𝐿_𝐴𝑉𝐺 = 𝐵𝑂 + 2 log2𝑁

The adder tree output requires log2𝑁 extra integer bits, but the divider gets rid of those bits, hence the output of the Average Unit only needs BO bits.

Accuracy can always be increased by incrementing the number of fractional bits the divider generates.

Methodology Experimental Setup:

Input signals: seven overlapping Gaussian-shaped signals (close match to the angular displacement response of the fly’s rhabdomers). 500 samples generated per channel, values quantized with 8, 10, and 12 bits per sample.

Design Space Exploration: Parameters: BO=16, NH =10,12,14,16, B=8,10,12, BQ=10,11,12,13,14

Accuracy measurement: PSNR Test 1: FPGA and software (MATLAB) implementation uses

the quantized input samples. This allows us to study the effect of the fixed-point architecture on accuracy.

Test 2: Only FPGA uses the quantized input samples. This allows us to study the effect of input quantization and the fixed-point architecture on accuracy.

Synthesis of VHDL code: Artix-7 XC7A100T FPGA

Results Input/Output Behavior

Case: B=12, [BO BQ] = [16 14], NH=16. There is not much visual difference if we change the parameters.

The output signals constitute the output of a primary signal path required for all image processing techniques.

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Normalized Target Location

Norm

aliz

ed O

utp

ut V

olta

ge

0 0.2 0.4 0.6 0.8 1-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Normalized Target Location

Norm

aliz

ed O

utp

ut V

olta

ge

C1

C6C3

C4 C5

C2

C7

LAO1

LAO6LAO3 LAO4

LAO5

LAO2LAO7

(a) (b)

Results Hardware Resource Utilization

The figure shows resources (in terms of 6-input LUTs and registers) for all the cases where [BO BQ] = [16 14]

The effect of BQ on resources is negligible and it is not shown.

For proper comparison, the DSP48E1s blocks were not used.

3500 4000 4500 5000 5500 6000

6000

8000

10000

12000

14000

LUTs

Regis

ters

B=8

B=10

B=12

NH=10

NH=12

NH=14

NH=16

Execution Time For BO=16, N=7, L=6, the I/O delay is given by:

𝑅𝐿_𝑆𝑌𝑆 = 𝑅𝐿_𝐼𝐼𝑅 + 𝑅𝐿_𝐴𝑉𝐺 + 𝑅𝐿_𝐹𝐼𝑅 = 36 cycles To compute NS samples per channel, we need 36+NS cycles. For 100 MHz, the execution time is: 36 + 𝑁𝑆 × 10−8𝑠𝑒𝑐𝑜𝑛𝑑𝑠

𝑇ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 =108

1+ 36𝑁𝑆

𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑝𝑒𝑟 𝑐ℎ𝑎𝑛𝑛𝑒𝑙/𝑠𝑒𝑐𝑜𝑛𝑑

Results Accuracy (PSNR)

Test 1: FPGA and software implementations use the quantized input samples.

Results only shown for B=12, as the effect of input bit-width (B) is negligible. NH and BQ have the strongest effect on accuracy. 10 11 12 13 14

65

70

75

80

85

90

95

BQ

PS

NR

(dB

)

NH=10

NH=12

NH=14

NH=16

Results Accuracy (PSNR)

Test 2: Only FPGA uses the quantized input samples.

Accuracy-resource trade-offs are indicated.

Accuracy is heavily affected by B and BQ. NH has some effect.

Resources are affected by B and NH.

3500 4000 4500 5000 5500 6000

55

60

65

70

75

80

85

Slices

PS

NR

(dB

)

NH

NH

=1

0

NH

=10

NH

=1

0 BQBQ

BQ

NH

=1

4

10...1

4

10

11

12

1314

NH

=1

6

NH

=1

2

NH

=16

Results Application: Edge Detection

Edge feature extraction: Gradients are computed specific to the orientation

Conclusions A scalable and fully-pipelined fixed-point architecture was

implemented and successfully validated.

This works demonstrates the feasibility of incorporatingdigital hardware into the design of largely analogcompound vision sensors.

The next step is to implement a system with severalsensors on an FPGA that can adapt resources at run-timebased on user-generated or automatic constraints.

Current work consist on implementing selected imageprocessing algorithms based on the outputs LAOx (x=1..7)generated by the system.

a scalable pipelined architecture for biomimetic vision...

Documents