advanced algorithms and research...
TRANSCRIPT
1
Advanced algorithms and research applications
Laurent LemarchandLaurent LemarchandLISyC/UBOLISyC/UBO
[email protected]@univ-brest.fr
2
Logic synthesis for LUT-based FPGApresentation
LUT-based FPGA
Synthesis flow
Boolean networks for circuit synthesis
Large scale problems : parallelism and partitionning (cf TCAD IEEE 01/2012)
Algorithms
3
Logic synthesis for LUT-based FPGAsynthesis flow
LUT-based circuits Cells Routing
4
Logic synthesis for LUT-based FPGAsynthesis flow
Circuit 1 000 4-LUT runtimes (mn)
Simplification Mis II 2
K-bounded Roth-Karp decomposition
6
Covering (surface) Mis-Pga 36
Covering (unit delay) Flowmap 3
High levelsynthesis
placement
routing
Logic synthesis
Logicoptimization
Technologymapping
5
Logic synthesis for LUT-based FPGAsynthesis problems size
Virtex-II
1000
Virtex-II
3000
Spartan-3 1000
Spartan-3 2000
Virtex-5
LX30
Virtex-5
LX50
Virtex-5
LX85
Virtex-5
LX110
Portes 1 million
s
3 million
s
1 million
s
2 million
s
----- ----- ----- -----
Bascules 10240 28672 15360 40960 19200 28800 51840 69120
LUT 10240 28672 15360 40960 19200 28800 51840 69120
Multiplieur
40 96 24 40 32 48 48 64
Bloc de RAM (kbit)
720 1728 432 720 1152 1728 3456
6
Logic synthesis for LUT-based FPGAsynthesis problems size
Algorithms have at leasr O(n2) complexity Combinatorial explosion Resources problem
Computations Memory
size (SOP size)
time (mn)
7
Logic synthesis for LUT-based FPGAsynthesis problems size
Algorithms have at leasr O(n2) complexity Combinatorial explosion Resources problem
Computations Memory
size (SOP size)
time (mn)
Divide and conquerPartitionning The design
8
Logic synthesis for LUT-based FPGAboolean network
Directed acyclic graph
Primaryinputs
Primaryoutputs
9
Logic synthesis for LUT-based FPGAboolean network and tech. mapping
Input : a Directed acyclic graph (DAG) G = (V, E) Output: K-feasible DAG G' = (V', E') :
v V', |inputs(v)| K
1 node = 1 K-LUT Technology mapping Lot of objectives
Surface : #LUT Delais : critical paths Routability : connection degrees & density
Optimizedbooleannetwork
Decomposition Feasiblenetwork
Technologymapping
Optimizedfeasiblenetwork
10
Chortle-crf FPGA (Field Programmable Gate Array)
Minimize the used # LUTs
Place and route the LUTs
11
Chortle-crf Dynamic Programming
Technology mapping for LUT-based FPGA
Cluster logic nodes into k-LUTs : one LUT can implement any logic function of up to k inputs (fanin) (truth table)
hd
e
b
c
f
a
g3-LUT
func(a,b,c)
a
b
c
g
12
Chortle-crf Dynamic Programming
d
eb1
b2
Process from inputs to outputs
Solution for d+b1+b2 must Minimize the number of LUT Minimize the fanin of head LUT
13
Let G = (V, E) with |V| = 2n find a partition X = V1 U V2 s.t |V1| = |V2| = n while minimizing edges crossing parts Parwise exchange neighborhood
Local search2-way partitionning problem
a
b
c=7
a
b
c=5 = 7-3+1
14
At each step, choice the exchange maximizing the cut number gain Constraint : a node can be swapped only one time N/2 steps at most
2-way partitionningKernighan-Lin heuristic
15
Extending 2-way partitionning Recursively or Kernighan Lin based
K-way partitionning : Minimize the global cut Balance parts size
Multi-level partitioning (Métis)
K-way partitionningtechniques
16
Multi level partitionning
Multi level partitionningHMétis
clustering
Groupingnodes
Partitionning After unclustering After refinement
unclusteringclustering
Initial partitiononto clustered
graph
17
Ciircuit partitionning nodes : logic gates edges : connections Create abd optimize sub systems
Multi level partitionningLogic synrthesis
18
Motivations: divide and conquer Simple, Multi algorithms Problem size runtimes Parallelism
Quality ? Synthesis ever with multiple algorithms rarely optimal Better heuristics
Limit information loss
Partition based logic synthesisData partitionning
19
Avoid maximum loss. 2-way partitionned network
Partition based logic synthesisNetwork partitionning
Nodes affected to parts 1 and 2 whileminimizing the cutand balance parts
20
Avoid information loss. Primary I/O generated
Partition based logic synthesisNetwork partitionning
21
Information loss : A and B in 2 distinct parts
Nodes are disconnected by the partitionning
Partition based logic synthesisQuality loss because Information loss
22
Each part must lead to the same computing load Evaluate a priori sysnthesis algorithms runtimes
Partition based logic synthesisLoad balancing
time time
accumulated time : 125speed up : 125/100 = 1.25
accumulated time : 125speed up : 125/35 = 3.57
23
Depends on synthesis algorithm And on network structure
Partition based logic synthesisLoad balancing
Algorithm (a) (b)
Boolean simplification O(n) O(1)
Technology mapping O(1) O(n)
24
Depend of algorithms nature Local. Ex : k-feasible network decomposition Global. Ex : delay optimization
Critical path optimization
Evaluation criteria Synthesis ruuntimes and speedups Quality
LUT based FPGA technology mapping tools Mis-PGA Flowmap-d
Partition based logic synthesisResults
25
Area optimization # LUT (CLBs = 2 LUTs in xc4100 series) Local and global decompositions (And/Or, Roth/Karp, kernels,
…) Global boolean simplifications (réinjections, substitution, ...) Exact or heuristic covering (BCP)
Important runtimes Automatic substitution of exact algorithms by heuristics for
large problems Impact on synthesis runtimes
Partition based logic synthesisMis-PGA
26
Area optimization. #CLBs 12 circuits bench LGSYNTH'91
#LUT (CLBs = 2 LUTs in xc4100 series FPGA)
Quality loss / global synthesis without partitionnning
Partition based logic synthesisMis-PGA : quality
Loss
(%
)#
CLB
# partitions
27
Cumulated runtime onto a single processor
speedup
Partition based logic synthesisMis-PGA : runtimes
Run
times
(se
c)sp
eedu
p
# partitions
28
Example : Flowmap-d : delay optimization Critcal paths (U) or nominal delay (N, congestion)
Partition based logic synthesisFlowmap-d : quality
parts
Loss (%)
29
Flowmap-d : O(n2)
Partition based logic synthesisFlowmap-d : runtimes
Parts (procs)
time speedupspeeduptime
mean
30
Quality Loss(25%) with unit delay model (critical path) Gain (10%) with nominal delay model (better optimization
with heuristics since partitionning process exhibit congested areas)
Speed up : Superlinear for large scale designs
Important runtimes required to absorb the paralllelism overcost
Partition based logic synthesisFlowmap-d results
31
QoS in Home Network for VBR
User demand : High Quality of Service for video broadcast Affordable, well known, closed, network environment Stream priority according to other network usages Bandwitdh reservation : guaranteed QoS (no delay)
gateway
tablet_1
connectedTV_1
PC_2
PC_1
console_1PC_3
L_3
L_2
L_1
STB_1
Wi-Fi
32
Bandwitdh reservation
VBR video encoding case : Allocate average rate ? quality loss Allocate peak rate ? bandwidth loss Allocate exact rate ? Unpracticable – hard constraints
tradeoff between peak and exact reservation
time
trh
rou
put
thro
ugh
put
33
Variable Bitrate Hull
Series of bitrates (amount of data per time slot) b
i for each time slot i (e.g 1-sec time slots)
ri : reservation of network bandwidth for each time slot ii r
i > b
i reservation hull that ensures QoS
time
bitrates reservations
r1
r4
r3r
2
thro
ugh
puts
34
Constraints on reservation policy
Two aspects taken into account. Reservation consists in configuring network resources
M : bounded # successive differents reservation ri
P : minimal time between 2 reconfigurations (ie, minimal reservation duration)
time
throughputs reservations Lost bandwidth
r1
r4
r3r
2
thro
ug
hpu
ts
35
Graph and optimization goal
time
r1
r4
r3r
2
thro
ug
hpu
ts
00 3323168121 52 67 61 total
cost:301
endstart
throughputs reservations Lost bandwidth
How to minimize ? total cost
36
Graph paths
time
r1
r4
r3r
2
thro
ug
hpu
ts
00 3323168121 52 67 61 total
cost :301
endstartr
1
Cost gain
r'1
OR:
00 8660 15
….
totalcost :255
throughputs reservations Lost bandwidth
37
Graph building
0t0tn+1
t1
cost0→1 endstart ….
cost1→2
cost0→2
cost0→i
cost1→i
t2
One node per time slot j > i an edge t
i → t
j : config at time t
i and reconfig at t
j
Weights correspond to overcosts An extra node at the end
38
Best solution computation
Best solution : minimal total overcost Shortest path
Contraints on bandwidth allocation P : minimal time between 2 reconfigurations
(ie, minimal configuration duration)
M : bounded # successive differents configurations ci
Bellman on DAG
remove edges i → j s.t j - i < P
M first steps of Ford-Bellman algorithm
39
Simulation results
NS2 simulation tool Hierarchical token bucket VBR : series of bitrates sources Delays and buffer size measurements
300 400 500 600 6400
200400600800
100012001400
0510152025303540
(a) fixed size HTB
delay (s)buffer size (kB)
bandwidth allocation
buff
er
size
(kB
)
dela
y (s
)
140 175 210 245 280 315 3500
50
100
150
200
250
300
0
5
10
15
20(b) M=149
delay (s)buffer size (kB)
average bandwidth allocation
buff
er
size
(kB
)
dela
y (s
)
40
Implementation
Server (Linux Ubuntu system)
Streamingcomponant
(VLC)
Net
wor
k In
terf
ace
Streamingobserver
Reservationpolicy
Hullvalues
times
configurations
stream
Client (Beagleboard)
Streamingclient
(mplayer)
Raisederrors
# errors #errors #frames
Reference 8 5.6x10-4
1429 kbits/s 11 7.7x10-4
640 kbits/s 80 56.0x10-4
Hull P5 17 12.0x10-4