© 2008 altera corporation high-quality, deterministic parallel placement for fpgas on commodity...
TRANSCRIPT
© 2008 Altera Corporation
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware
Adrian Ludwin, Vaughn Betz and Ketan Padalia
2
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
0
5
10
15
20
25
30
35
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
Re
lati
ve
to
19
99
SPEC CINT2000 Largest device in Quartus II
FPGA Size vs CPU PerformanceFPGA Size vs CPU Performance
CPUs: 7x faster
FPGAs:33x bigger
3
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Our ContributionsOur Contributions
Parallelized existing high-quality placer Routability, timing and power driven Deterministic Good speedups with identical quality
Present results on multicore PCs Identify and quantify bottlenecks
4
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Non-DeterminismNon-Determinism
Extremely difficult to test for correctness Extremely difficult to reproduce problems Very unpopular with customers
Some outright refuse to use ND algorithms All customers value reproducible results
We show that making our algorithms deterministic has a relatively small impact
on performance.
5
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Serial EquivalencySerial Equivalency
Any number of cores returns same result Including a single core (hence “serial”)
Easy if algorithm is already deterministic Even easier to test than determinism
Serial equivalency has no additional overhead over determinism in our
algorithms.
6
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Algorithm RuntimesAlgorithm Runtimes
Other Fitter (eg route)
Other CAD (eg map)
Placer Algorithms
The placer algorithms in this paper are a significant portion of overall runtime, but
are not a majority
7
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
AgendaAgenda
Part I: Pipelined Moves Part II: Parallel Moves
8
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}
Proposal
Evaluation
Algorithm Pseudo-CodeAlgorithm Pseudo-Code
9
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}
Proposal
Evaluation
40%time
60%time
Expected speedup: 1/0.6 ≈ 1.7x
Effect of Pipelining ProposalsEffect of Pipelining Proposals
10
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Core 0 Core 1
Evaluation
Proposal
Simplistic ImplementationSimplistic Implementation
11
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Evaluation(C1)
Proposal(C0)
Move1
Move1
Move0
In this example, C1 has just started evaluating a move, while C0 has just started proposing the next one.
Since proposals are faster than evaluations (at least in theory), C0 will finish before C1. It then stalls until C0 is ready to take the move.
Simplistic ImplementationSimplistic Implementation
When C1 is ready, it grabs the proposed move and starts evaluating it, and C0 can begin proposing the next move.
12
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Proposal(C0)
Evaluation(C1)
Move2
Move1
Move2
Simplistic ImplementationSimplistic Implementation
13
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Naïve Pipelined ProblemsNaïve Pipelined Problems
1. Proposal/evaluation runtime variability If evaluation is faster than proposal, then the
stall happens on the critical path
2. Large penalty for stalling After C0 stalls, it takes almost as long to
wake it up as it does to propose the move in the first place!
14
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Proposal(C0)
Evaluation(C1)
Better ImplementationBetter Implementation
15
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
MoveMoveMoveEvaluation
QueueEvaluation
(C1)MoveMoveMove
Proposal(C0)
Better ImplementationBetter Implementation
16
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Evaluation Queue
MoveMove Move MoveProposal
(C0)Evaluation
(C1)
Better ImplementationBetter Implementation
The queue buffers proposal/evaluation runtime variability and “hides” the stalls on C0 from the critical path on C1.
17
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Evaluation Queue
MoveMove Move MoveProposal
(C0)Evaluation
(C1)
Accepted Moves Queue
Proposal State UpdatesProposal State Updates
18
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Block
1
Block
2
Move 1
Move 5
Proposal ExampleProposal ExampleIn this example, we propose a move for block 1 to an empty locationSince we don’t know if it will ultimately be accepted by the evaluation stage, we
assume (for the time being) that it will be rejected.Some time later, if we haven’t heard back from the evaluation stage, it might be reasonable to propose a move for another block to the same “empty” location.
19
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Block
1
Block
2
Move 5?
Move 1
accepted
Evaluation ExampleEvaluation ExampleIn the meantime, however, the evaluation stage has accepted Move 1 – it just
wasn’t able to tell the proposal stage about it in time (race condition!)But the later move to the no-longer-empty location is already in the pipe. It can no
longer be performed as proposed; what should we do about this?
20
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Resolving CollisionsResolving Collisions
When two moves have collided, we can: Abandon the later moves (non-deterministic) Attempt to “fix” colliding moves
We fix it by reproposing it In this example, Move 5 becomes a swap This gives the same move as in the serial flow
Therefore, the placer is serially equivalent
21
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
mem ctrlmem ctrl
PlatformsPlatforms
C0 C1
$0 $1
mem ctrl
2 GB
C0 C1
$0 $1
C2 C3
$2 $3
mem ctrl mem ctrl
4 GB 4 GB
C0 C1
$0/1
C2 C3
$2/3
16 GB
mem ctrl
nb opt-mc c2-mcopt-dcopt-dp c2-dcc2-dpNetburst x2
(Pentium 4)
Dual-core Opteron x2 Core 2 Duo x2
To test a two-core algorithm on a four-core machine, we can either use two cores on the same package (“dc” = “dual core”) …
… or we can use one core on each package (“dp” = “dual processor”). This decision has a large influence on the performance of the algorithm.
22
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Pipelined Results - 11 CircuitsPipelined Results - 11 Circuits
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
nb opt-dc opt-dp c2-dc c2-dp
par
alle
l sp
eed
up
The results are far lower than the 1.7x ideal. Note that the best and worst results are both on the same platform (Core 2). Where is the runtime going on c2-dp?
23
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Algorithm Components – c2-dpAlgorithm Components – c2-dp
0
10
20
30
40
50
60
serial serial withtimers
pipelinedequivalent
pipelinedwith timers
pipelined
mic
rose
con
d
evaluation infrastructure reproposals stall proposal all
This is the pipelined algorithm, but with both stages taking turns on the same
core.
This uses high-resolution timers to show the runtime of
each stage.
For the pipelined algorithm, we ignore
the proposal time since it’s “hidden.”
But why has the evaluation time gotten so big?
24
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Explaining the ResultsExplaining the Results
Reproposals, stalls are very fast Memory is bottleneck on 4/5 platforms
Exception: c2-dc has large, shared cache Many, many more details are in the paper
25
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Pipelined Moves SummaryPipelined Moves Summary
Poor inherent scalability, memory usage Reasonable speedups for amount of work
Far less work than fully parallel moves
26
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
AgendaAgenda
Part I: Pipelined Moves Part II: Parallel Moves
27
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
move = propose(place);cost = evaluate(place, move);if(cost < 0) { accept(place, move);}
Processing(propose and evaluate)
Finalization(resolve collisions and commit)
99%time
1%time
Stages with Thread-Safe CodeStages with Thread-Safe Code
28
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Core 0 Core 1 Core 2 Core 3
Queue
Finalize
Process(C2)
Process(C3)
Process(C0)
Process(C1)
Finalization(resolve collisions and commit)
Processing(propose and evaluate)
Processing(propose and evaluate)
Processing(propose and evaluate)
Processing(propose and evaluate)
29
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Queue
Finalize
Process(C2)
Process(C3)
Process(C0)
Process(C1)
Move0
Move1
Move0
Move1
Move2
Move3
Move4
Finalize(C0)
All four cores begin processing moves at the same time. Since
finalizing moves is so fast, it would be a waste to devote a core to that
task. Instead, all cores have the ability to finalize moves at the
appropriate time, as this example will show.
If one finishes out of order, it sits in the priority queue until the earlier
moves are finished. Meanwhile, the core that processed it goes onto the next move. It does not stall and wait
for any other cores.
The priority queue now has two moves ready to be finalized.
The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the
move it inserted went to the front of the queue.
30
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Supervisor (2)Supervisor (2)
Queue
Finalize
Process(C2)
Process(C1)
Process(C3)
Move0
Move1
Finalize(C0)
Move2
Move3
Move4
The core that processed this last move now becomes responsible for finalizing all the moves in the queue. Note that it did not have to wait for any other core; it knows that the
move it inserted went to the front of the queue.
31
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Supervisor (3)Supervisor (3)
Queue
Finalize
Process(C0)
Process(C1)
Process(C2)
Process(C3)
Process(C2)
Move2
Move3
Finalize(C2)
Move2
Move3
Move4
Move6
Move5
Process(C2)
Move7
Once a core has finished finalizing moves, it immediately goes back to
processing them. The algorithm continues, with any core being able
to finalize moves whenever it’s appropriate.
32
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
1.0
1.5
2.0
nb opt-dc opt-dpaaa
c2-dc c2-dp
par
alle
l sp
eed
up
Pipelined Parallel Moves - 2 Cores Parallel Moves - 4 Cores
Parallel Results - 11 CircuitsParallel Results - 11 Circuits
opt-mc c2-mc
33
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Algorithm Components – c2-mcAlgorithm Components – c2-mc
0
20
40
60
80
100
serial serial +timers
parallelequiv.
parallelper move
parallelest.
parallel
mic
rose
con
ds
process infrastructure repropose/reevaluate stall all
34
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
Parallel Moves SummaryParallel Moves Summary
Memory still bottleneck Especially at 4 cores But less than in pipelined
Much more scalable (N instead of 1.7x)
35
© 2008 Altera Corporation - Public
Altera, Stratix, Cyclone, MAX, HardCopy, Nios, Quartus, and MegaCore are trademarks of Altera Corporation
ConclusionsConclusions
Significant parallelism in existing placer Believe sufficient parallelism for 8-16 cores More independent moves could scale further
Determinism has a relatively low cost Memory is largest parallel bottleneck
Better hardware will help A first-order concern for algorithm developers