advanced algorithms and research...

1

Advanced algorithms and research applications

Laurent LemarchandLaurent LemarchandLISyC/UBOLISyC/UBO

[email protected]@univ-brest.fr

mailto:[email protected]

2

Logic synthesis for LUT-based FPGApresentation

LUT-based FPGA

Synthesis flow

Boolean networks for circuit synthesis

Large scale problems : parallelism and partitionning (cf TCAD IEEE 01/2012)

Algorithms

3

Logic synthesis for LUT-based FPGAsynthesis flow

LUT-based circuits Cells Routing

4

Logic synthesis for LUT-based FPGAsynthesis flow

Circuit 1 000 4-LUT runtimes (mn)

Simplification Mis II 2

K-bounded Roth-Karp decomposition

6

Covering (surface) Mis-Pga 36

Covering (unit delay) Flowmap 3

High levelsynthesis

placement

routing

Logic synthesis

Logicoptimization

Technologymapping

5

Logic synthesis for LUT-based FPGAsynthesis problems size

Virtex-II

1000

Virtex-II

3000

Spartan-3 1000

Spartan-3 2000

Virtex-5

LX30

Virtex-5

LX50

Virtex-5

LX85

Virtex-5

LX110

Portes 1 million

s

3 million

s

1 million

s

2 million

s

----- ----- ----- -----

Bascules 10240 28672 15360 40960 19200 28800 51840 69120

LUT 10240 28672 15360 40960 19200 28800 51840 69120

Multiplieur

40 96 24 40 32 48 48 64

Bloc de RAM (kbit)

720 1728 432 720 1152 1728 3456

6


Algorithms have at leasr O(n2) complexity Combinatorial explosion Resources problem

Computations Memory

size (SOP size)

time (mn)

7


Algorithms have at leasr O(n2) complexity Combinatorial explosion Resources problem

Computations Memory

size (SOP size)

time (mn)

Divide and conquerPartitionning The design

8

Logic synthesis for LUT-based FPGAboolean network

Directed acyclic graph

Primaryinputs

Primaryoutputs

9

Logic synthesis for LUT-based FPGAboolean network and tech. mapping

Input : a Directed acyclic graph (DAG) G = (V, E) Output: K-feasible DAG G' = (V', E') :

v V', |inputs(v)| K

1 node = 1 K-LUT Technology mapping Lot of objectives

Surface : #LUT Delais : critical paths Routability : connection degrees & density

Optimizedbooleannetwork

Decomposition Feasiblenetwork

Technologymapping

Optimizedfeasiblenetwork

10

Chortle-crf FPGA (Field Programmable Gate Array)

Minimize the used # LUTs

Place and route the LUTs

11

Chortle-crf Dynamic Programming

Technology mapping for LUT-based FPGA

Cluster logic nodes into k-LUTs : one LUT can implement any logic function of up to k inputs (fanin) (truth table)

hd

e

b

c

f

a

g3-LUT

func(a,b,c)

a

b

c

g

12

Chortle-crf Dynamic Programming

d

eb1

b2

Process from inputs to outputs

Solution for d+b1+b2 must Minimize the number of LUT Minimize the fanin of head LUT

13

Let G = (V, E) with |V| = 2n find a partition X = V1 U V2 s.t |V1| = |V2| = n while minimizing edges crossing parts Parwise exchange neighborhood

Local search2-way partitionning problem

a

b

c=7

a

b

c=5 = 7-3+1

14

At each step, choice the exchange maximizing the cut number gain Constraint : a node can be swapped only one time N/2 steps at most

2-way partitionningKernighan-Lin heuristic

15

Extending 2-way partitionning Recursively or Kernighan Lin based

K-way partitionning : Minimize the global cut Balance parts size

Multi-level partitioning (Métis)

K-way partitionningtechniques

16

Multi level partitionning

Multi level partitionningHMétis

clustering

Groupingnodes

Partitionning After unclustering After refinement

unclusteringclustering

Initial partitiononto clustered

graph

17

Ciircuit partitionning nodes : logic gates edges : connections Create abd optimize sub systems

Multi level partitionningLogic synrthesis

18

Motivations: divide and conquer Simple, Multi algorithms Problem size runtimes Parallelism

Quality ? Synthesis ever with multiple algorithms rarely optimal Better heuristics

Limit information loss

Partition based logic synthesisData partitionning

19

Avoid maximum loss. 2-way partitionned network

Partition based logic synthesisNetwork partitionning

Nodes affected to parts 1 and 2 whileminimizing the cutand balance parts

20

Avoid information loss. Primary I/O generated

Partition based logic synthesisNetwork partitionning

21

Information loss : A and B in 2 distinct parts

Nodes are disconnected by the partitionning

Partition based logic synthesisQuality loss because Information loss

22

Each part must lead to the same computing load Evaluate a priori sysnthesis algorithms runtimes

Partition based logic synthesisLoad balancing

time time

accumulated time : 125speed up : 125/100 = 1.25

accumulated time : 125speed up : 125/35 = 3.57

23

Depends on synthesis algorithm And on network structure

Partition based logic synthesisLoad balancing

Algorithm (a) (b)

Boolean simplification O(n) O(1)

Technology mapping O(1) O(n)

24

Depend of algorithms nature Local. Ex : k-feasible network decomposition Global. Ex : delay optimization

Critical path optimization

Evaluation criteria Synthesis ruuntimes and speedups Quality

LUT based FPGA technology mapping tools Mis-PGA Flowmap-d

Partition based logic synthesisResults

25

Area optimization # LUT (CLBs = 2 LUTs in xc4100 series) Local and global decompositions (And/Or, Roth/Karp, kernels,

…) Global boolean simplifications (réinjections, substitution, ...) Exact or heuristic covering (BCP)

Important runtimes Automatic substitution of exact algorithms by heuristics for

large problems Impact on synthesis runtimes

Partition based logic synthesisMis-PGA

26

Area optimization. #CLBs 12 circuits bench LGSYNTH'91

#LUT (CLBs = 2 LUTs in xc4100 series FPGA)

Quality loss / global synthesis without partitionnning

Partition based logic synthesisMis-PGA : quality

Loss

(%

)#

CLB

# partitions

27

Cumulated runtime onto a single processor

speedup

Partition based logic synthesisMis-PGA : runtimes

Run

times

(se

c)sp

eedu

p

# partitions

28

Example : Flowmap-d : delay optimization Critcal paths (U) or nominal delay (N, congestion)

Partition based logic synthesisFlowmap-d : quality

parts

Loss (%)

29

Flowmap-d : O(n2)

Partition based logic synthesisFlowmap-d : runtimes

Parts (procs)

time speedupspeeduptime

mean

30

Quality Loss(25%) with unit delay model (critical path) Gain (10%) with nominal delay model (better optimization

with heuristics since partitionning process exhibit congested areas)

Speed up : Superlinear for large scale designs

Important runtimes required to absorb the paralllelism overcost

Partition based logic synthesisFlowmap-d results

31

QoS in Home Network for VBR

User demand : High Quality of Service for video broadcast Affordable, well known, closed, network environment Stream priority according to other network usages Bandwitdh reservation : guaranteed QoS (no delay)

gateway

tablet_1

connectedTV_1

PC_2

PC_1

console_1PC_3

L_3

L_2

L_1

STB_1

Wi-Fi

32

Bandwitdh reservation

VBR video encoding case : Allocate average rate ? quality loss Allocate peak rate ? bandwidth loss Allocate exact rate ? Unpracticable – hard constraints

tradeoff between peak and exact reservation

time

trh

rou

put

thro

ugh

put

33

Variable Bitrate Hull

Series of bitrates (amount of data per time slot) b

i for each time slot i (e.g 1-sec time slots)

ri : reservation of network bandwidth for each time slot ii r

i > b

i reservation hull that ensures QoS

time

bitrates reservations

r1

r4

r3r

2

thro

ugh

puts

34

Constraints on reservation policy

Two aspects taken into account. Reservation consists in configuring network resources

M : bounded # successive differents reservation ri

P : minimal time between 2 reconfigurations (ie, minimal reservation duration)

time

throughputs reservations Lost bandwidth

r1

r4

r3r

2

thro

ug

hpu

ts

35

Graph and optimization goal

time

r1

r4

r3r

2

thro

ug

hpu

ts

00 3323168121 52 67 61 total

cost:301

endstart


How to minimize ? total cost

36

Graph paths

time

r1

r4

r3r

2

thro

ug

hpu

ts

00 3323168121 52 67 61 total

cost :301

endstartr

1

Cost gain

r'1

OR:

00 8660 15

….

totalcost :255


37

Graph building

0t0tn+1

t1

cost0→1 endstart ….

cost1→2

cost0→2

cost0→i

cost1→i

t2

One node per time slot j > i an edge t

i → t

j : config at time t

i and reconfig at t

j

Weights correspond to overcosts An extra node at the end

38

Best solution computation

Best solution : minimal total overcost Shortest path

Contraints on bandwidth allocation P : minimal time between 2 reconfigurations

(ie, minimal configuration duration)

M : bounded # successive differents configurations ci

Bellman on DAG

remove edges i → j s.t j - i < P

M first steps of Ford-Bellman algorithm

39

Simulation results

NS2 simulation tool Hierarchical token bucket VBR : series of bitrates sources Delays and buffer size measurements

300 400 500 600 6400

200400600800

100012001400

0510152025303540

(a) fixed size HTB

delay (s)buffer size (kB)

bandwidth allocation

buff

er

size

(kB

)

dela

y (s

)

140 175 210 245 280 315 3500

50

100

150

200

250

300

0

5

10

15

20(b) M=149

delay (s)buffer size (kB)

average bandwidth allocation

buff

er

size

(kB

)

dela

y (s

)

40

Implementation

Server (Linux Ubuntu system)

Streamingcomponant

(VLC)

Net

wor

k In

terf

ace

Streamingobserver

Reservationpolicy

Hullvalues

times

configurations

stream

Client (Beagleboard)

Streamingclient

(mplayer)

Raisederrors

# errors #errors #frames

Reference 8 5.6x10-4

1429 kbits/s 11 7.7x10-4

640 kbits/s 80 56.0x10-4

Hull P5 17 12.0x10-4

advanced algorithms and research...

Documents