tkt-1212 digitaalijärjestelmien suunnittelu ·  · 2013-01-30tkt-1212 digitaalijärjestelmien...

77
FSM implementations, VHDL synthesis basics Erno Salminen, 2013 TKT-1212 Digitaalijärjestelmien Suunnittelu Current state Input Next State Output Moore

Upload: vokhanh

Post on 20-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

FSM implementations, VHDL synthesis basics

Erno Salminen, 2013

TKT-1212Digitaalijärjestelmien Suunnittelu

CurrentstateInput

Next

State

Output

Moore

Acknowledgements Prof. Pong . P. Chu provided ”official” slides for the book

which is gratefully aknowledged See also: http://academic.csuohio.edu/chu_p/

Most slides were made by Ari Kulmala and other previous lecturers (Teemu Pitkänen, Konsta Punkka,

Mikko Alho…)

M. Perkows, ECE 590. DIGITAL SYSTEM DESIGN USING HARDARE DESCRIPTION LANGUAGES, Portland State University, June 2008, http://web.cecs.pdx.edu/~mperkows/CLASS_VHDL_99/June2008/

Erno Salminen, TUT, 2012

Logic synthesis - From algorithm to circuit

Algorithm

Architecture

Register level

Gate level

Logic synthesis

Behavioral synthesis

For i = 0 ; i = 15sum = sum + data[I]

Data[0]

Data[15]

i

sum

Data[15] Data[0]

Sum

MEM

clock

0% technology dependent

10% technology dependent

20% technology dependent

100% technology dependent

Adapted from [M. Perkows, Class VHDL 99, June 2008]

FSM

Erno Salminen, TUT, 2012

Finite State Machines

All the previous teachings are still valid and just the description stylechanges

Stopy=z1

Playy=z2

Play x 2y=z3

Rewplay x 2y=z4

Next_tracky=z5

Prev_tracky=z6

in=pl

in=st

in=x2

in=x2

in=next

in=prev

in=othersin=others

in=-x2

in=-x2

in=others

in=others

always

Erno Salminen, TUT, 2012

Finite-state machine (FSM) Application defines the set of correct state machines Two basic flavors: Mealy and Moore In both cases, one must select whether to include the output

registers or not

Moreover, you decide the VHDL presentation of FSM Description style: how many processes Encoding of states, if not automated in synthesis

CurrentstateInput

Next

StateOutput

CurrentstateInput

Next

State

Output

MealyMooreErno Salminen, TUT, 2012

FSM Implementation in VHDL General form:

We define an own type for the state machine states ALWAYS use enumeration type for state machine states synthesis software, e.g. Quartus II, does not recognize it otherwise

architecture rtl of traffic_light is

type states_type is (red, yellow, green);

-- init state explicitly defined in reset, not heresignal ctrl_r : states_type;

...begin -- rtl

Enumeration type for states

Signal ctrl_r is the current state register

Erno Salminen, TUT, 2012

At least 5 implementation styles1. 1 sequential process2. 2 processes

a) Seq: curr. state registers and output, Comb: next state logicb) Seq: curr state, Comb: next state, outputc) Seq: next and curr state, Comb: output

3. 3 processes (Seq: curr state, Comb: output, Comb: nextstate logic separated)

CurrentstateInput

Next

StateOutput

CurrentstateInput

Next

State

Output

MealyMooreErno Salminen, TUT, 2012

Coding style: 1seg-Moore/Reg-Mealy 1-segment style uses just 1 sequential process

sync_all : process (clk, rst_n)begin -- process singleif rst_n = '0' then

<INIT STATE and OUTPUT OF THE FSM>

elsif clk'event and clk = '1' then<define new value of curr state><define output. These outputs becomeregisters!>

end if;end process sync_all;

Erno Salminen, TUT, 2012

Coding style: 2seg-Moore Moore, two segment coding style uses 1 seq. process and 1

comb.proc/concurrent assignments

sync_ps : process (clk, rst_n)begin -- process singleif rst_n = '0' then

<INIT STATE OF THE FSM>

elsif clk'event and clk = '1' then<Synchronous part of the FSM; assign next state to curr state>

end if;end process sync_ps;

comb_ns_output : process (ctrl_r, input)begin -- process output

<Combinational part;define next state, define output>

end process ns_output;

end rtl;

9 Erno Salminen, TUT, 2012

Coding style: 2seg-Mealy Mealy, two-segment coding style is like the previous one

sync_ps : process (clk, rst_n)begin -- process single

if rst_n = '0' then <INIT STATE OF THE FSM>

elsif clk'event and clk = '1' then <Synchronous part of the FSM; assign next state to curr state>

end if;end process sync_ps;comb_ns_output : process (ctrl_r, input)begin -- process output

<Combinational part;define next state,define output(Looks same as Moore, but here alsothe input is considered for output)>

end process ns_output;end rtl;

Erno Salminen, TUT, 2012

Coding style: 3seg_Moore/Mealy 3-segment style separates next state and outpu logic into their own

comb. Processes

sync_ps : process (clk, rst_n)begin -- process single

if rst_n = '0' then<INIT STATE OF THE FSM>

elsif clk'event and clk = '1' then<curr state assignment>

end if;end process sync_ps;

comb_ns : process (ctrl_r, input)begin -- process output

<Combinational part;define next state>

end process comb_ns;

comb_output : process (ctrl_r, input)begin -- process output

<Combinational part;define output>

end process comb_output;

Erno Salminen, TUT, 2012

Examples: Traffic light FSM implemented with various styles Examples shown as VHDL code Bottom of page

http://www.tkt.cs.tut.fi/kurssit/1212/K11/luennot.html

They also show the usage of counter in state machine Acts as a timer for showing yellow light

Output latency is larger for Moore and registered Mealy

However, all the implementations keep yellow light on for same amount of time Sometimes designer must modify timer limit values, e.g. ifcounter_r = limit-1 instead of counter_r = limit)

Erno Salminen, TUT, 2012

Timing of trafficlight FSMs All versions show yellow light the

amount of time r_y_g = ”010”

However, there is 1 cycle differencewhen yellow is ON Inside the DUV, state and counter values

are aligned differently as well

I. Mealy 2a,2b,3 react immediately No output reg, only comb. delay from req

to r_y_g (=0ns in RTL simulation)

II. Mealy 1 needs 1 cycle Output reg and curr_state_r change

simultaneously VHDL code assigns them when the

state changes

III. Moore 2 needs 1 cycle Updating curr_state_r takes 1 cycle Output assigned combinatorially

from curr_state_r (=0ns in RTL simulation)

Mealy_1

Mealy_2a

Mealy_2b

Mealy_3

Moore_2

II

I

IIIErno Salminen, TUT, 2012

Comparison of implementation styles: Coding style 1-segment:

Just sync processAutomatically inferred output registers Simple view to the design, everything at one place Safe, registered Mealy machine is easy to implement with this style Recommended (as opposite to the course book!)

2-segment, 3-segment Only way to implement unregistered outputs to FSMs Modular Long ago synthesis tools did not recognize 1-segment FSMs correctly

Not the case anymore Recommended style in many books, partially because of those limitations of the old tools

Useful for quite simple control machines that do not have e.g. delay countersincluded

Complex state machines are cumbersome to read The code does not proceed smoothly, have to jump around the code The same conditions may be repeated in many processes

Erno Salminen, TUT, 2012

Quartus II design flow after you’ve simulated and verified the design

Generic gate-level representation, just gates and flip-flops

Places and routes the logic into a device, logic elements, macros and routing cells

Converts the post-fit netlist into a FPGA programming file (.sof)

Analyzes and validates the timing performance of all logic in a design.

Run on FPGAErno Salminen, TUT, 2012

Examples: extracted state diagramTool A

Note the encoding: it’s not basic binary nor one-hot.Erno Salminen, TUT, 2012

RTL Synthesis result: Tool ARegister for output bit 2

Registers for output bits 1..0

Registered mealy machine, traffic light VHD

State register

Combinatorialoutput logic

Next state logic, incl. counter for showing yellow lightlong enough

Comb path frominput to output logic

Erno Salminen, TUT, 2012

Technology schematic, Tool A

Logic on previous slide has been mapped to FPGA’s primitives.Registered mealy machine, traffic light VHD

Single flip-flops

Look-up tables, max 6 inputs

Erno Salminen, TUT, 2012

RTL Synthesis result: Tool B Same VHDL, slightly different result

Registered mealy machine, traffic light VHD

# Info: [45144]: Extracted FSM in module work.traffic_light(rtl){generic map (n_colors_g => 3 yellow_length_g => 10)}, with state variable = ctrl_r[1:0], async set/reset state(s) = 00 , number of states = 3.# Info: [45144]: Preserving the original encoding in 3 state FSM# Info: [45144]: FSM: State encoding table.# Info: [40000]: FSM: Index Literal Encoding# Info: [40000]: FSM: 0 00 00# Info: [40000]: FSM: 1 01 01# Info: [40000]: FSM: 2 10 10

Note the different state encoding

Erno Salminen, TUT, 2012

Technology schematic, Tool B

Multi-bit registers

LUTs

Registered mealy machine, traffic light VHDErno Salminen, TUT, 2012

Physical placement on-chipThe traffic_light.vhd place and routed

Stratix 2S180, 143 000 ALUTs (~LUTs)

Quite much unused resources...Erno Salminen, TUT, 2012

Implementation area and freq Note that no strict generalization can be made about the

”betterness” Tool A Total ALUTs 15 ALUTs with register 10

Tool B LUTs 16 Registers 9

The one register difference is due to the different stateencoding The state encoding can be explicitly defined or left to the tool

to choose (as in this case)

Erno Salminen, TUT, 2012

Synthesis of different VHDLs

Functionally equivalent

Timing aspect vary Different max frequency Only the ”Mealy single” has output registers (3 bit)

Coding style has a minor effect here

Readibility of the code is as crucial!

AREA [LUT] AREA [reg] Lines of Codemealy (single) 16 9 104mealy (output separated) 13 6 126mealy_2proc. (out+ns separated) 11 6 125mealy_3proc 11 6 150Moore 11 6 108

Erno Salminen, TUT, 2012

Comparison of implementation styles: Moore and Mealy Number of processes does not affect HW area and speed

deterministically. The differences are mainly in readability

Generally, we want that outputs are registered Trad. Mealy machine is sometimes problematic due to possible

long combinational paths or loops

For registered outputs, use a registered Mealy machine Outputs are registered, but has shorter latency than Moore

machine with registered outputs

Otherwise, opt for Moore machine

Erno Salminen, TUT, 2012

Notes on finite state machines Quite often datapath and control get mixed in

HDL description

Start with ”slow and simple FSM” if neighbor blocks allow that Takes few extra cycles but has less branching and

hence simpler conditions Easier to get it working at all You can later reduce few clock cycles by skipping

some state in certain conditions (e.g. adding red arc wait_ack -> write)

Be careful with the timing of output register

Wait data

writeWait ack

Erno Salminen, TUT, 2012

Synthesis observations Make sure that you are aware of what signals of the shown codes have

been implemented as registers! In most cases, use enumeration and let the synthesis tool to decide the

state encoding Not much difference in delay or area in realistic circuits One-hot encoding is easiest to debug!

Different tools produce slightly different results in even small designs Synthesis tools are heuristic due to very large design space Modest effect (e.g. -10%- +10%) also achieavable by tuning the tool settings

Even a single tool may produce slightly different results on different runs! Optimization heuristics utilize randomness

However, no tool can convert a bad design into a good one!

Erno Salminen, TUT, 2012

Logic synthesis

Erno Salminen, TUT, 2012

Foreword: VHDL and synthesis The main goal of writingVHDL is to generate synthesizable

description

This lecture presents some practical examples of how to write code that is good for synthesis

The quality of the design is much affected by the coding style you must be able to choose structures that synthesize the best

Erno Salminen, TUT, 2012

Logic design basics still apply Modularize the design to components Easier to design single components Easier to upgrade Prefer registered outputs

Use version control system (e.g. SVN or git) Thou shalt not mess with clock and reset Use only rising-edge clocking. Preferably active-high enable signals Asynchronous reset is used only to initialize Not part of the functionality Hence, you don’t force reset from your code Use separate clear-signal or similar if needed. That is checked on the

edge sensitive if-branch (lec 12)

Erno Salminen, TUT, 2012

Simplest synthesis example In this course, we concentrate on RTL synthesis: how is HDL

converted into netlist of basic gates and flip-flops Technology mapping, routing and placement are beyond the scope of this

course

Example: arith_result <= a + b + c - 1; The resulting combinatorial logic is straightforward

Inclusion of DFF depends on the context (inside a sync process or not)

Erno Salminen, TUT, 2012

Example: if-else

Conceptual structure of nested if-clauses in HDL

Conceptual hardware realization (when none of the value_expr_x is ”ZZ…Z”. High-impedance covered in Lec 12. )

Erno Salminen, TUT, 2012

Example: selected assignment

Note the similarityto –if-clause inside a process

Fig 1. Basic form of synthesized logic

Fig 2. Full logic

Similar to if-else

Erno Salminen, TUT, 2012

Logic from case-clause

Conceptual structure of case-clause in HDL Conceptual HW realization

(when tri-state isn’t used)

This example has 2 outputs but again the logic is similar to if-clause

Erno Salminen, TUT, 2012

More complexSelectedassignemt

Fig1. Conceptual hardware realizationFig 2. Full logic

Erno Salminen, TUT, 2012

Example: loop

Bounds must be static , likehere (3 down 0)

The loop is ”unrolled” in logic

Evertyhing happens in parallel!

Hence, the loop is equivalentto

Sidenote: y <= a xor b is evenbetter with std_logic_vectors, butthen we would not have an example of a loop Erno Salminen, TUT, 2012

Loops: example2 in SW With software

for (i = 0; i < max_c; i++) {b(i) = a(i) + i;

}

Iterative calculation for b(i) (simplified)1. Calculate for-clause2. Fetch a3. Add4. Store b5. Increment i6. Go back to 1

Takes a lot of clock cycles (several even with loop-unrolling)

Erno Salminen, TUT, 2012

Loops: example2 in HW Hardware:

add_i: for i in 0 to max_c-1 generateb(i) <= a(i) + i;

end generate add_i;

Generates <max_c> parallel computation units High area overhead

Result generated in 1 clock cycle, very fast!

However, in HW we can adjust the area-performance ratio Pipeline e.g. half of the result on the first cycle, rest on the

second Fully sequential (the SW case), still faster than SW, needs FSM

Erno Salminen, TUT, 2012

Sharing Arithmetic operators Large implementation Limited optimization by synthesis software Data width has a major impact

Area reduction can be achieved by “sharing” in RT level coding Operator sharing Functionality sharing

Clever tools may do that automatically and not-so-smart onesneed some guidance (different coding style)

Erno Salminen, TUT, 2012

Sharing (2) Possible when “value expressions” in priority network and multiplexing

network are mutually exclusively: Only one result is routed to output Generic format of conditional signal assignment guarantees this:

sig_name <= value_expr_1 when boolean_expr_1 elsevalue_expr_2 when boolean_expr_2 elsevalue_expr_3 when boolean_expr_3 else. . .

value_expr_n;

Erno Salminen, TUT, 2012

Sharing example 1 Original code:

r <= a+b when boolean_exp elsea+c;

Revised code (enables sharing):src0 <= b when boolean_exp else

c;r <= a + src0;

NOTE:Coded outside a process

Erno Salminen, TUT, 2012

Area: 2 adders, 1 mux, Bool

Delay:

Area: 1 adder, 1 mux, Bool

Delay:

However, no free lunch in general: sharing reduces area A but increases delay T in this caseErno Salminen, TUT, 2012

Sharing example 2 Original code:

process(a,b,c,d,...)begin

if boolean_exp thenr <= a+b;

elser <= a+c;

end if;end process;

NOTE:Coded inside a processEquivalent with previous

Revised code:process(a,b,c,d,...)begin

if boolean_exp thensrc0 <= a;src1 <= b;

elsesrc0 <= a;src1 <= c;

end if;end process;r <= src0 + src1;

Erno Salminen, TUT, 2012

Guidelines for synthesizable HDL

Erno Salminen, TUT, 2012

Synthesis is integral part right from the start Separate non-synthesizable (testbench) code into their own

entities Start trial syntheses early. Non-synthesizable structures will be

detected early as well Automate synthesis runs with scripts Often it is good to perform parameter sweep, e.g. synthesize data

widths 8, 16,…128 bits Do not optimize your design before the HW cost (area, delay)

have been proven! You can skip non-synthesizable parts with pragmas. Use with care-- synthesis translate_offuse std.textio.all;-- synthesis translate_on

Erno Salminen, TUT, 2012

General guidelines and hints Initial values for signals are not generally synthesizable Used only simulator (and some synthesis tools) You must reset all sequential (control) signals explicitly. You must NOT reset any combinatorial signals or those coming

from component instances’ outputs

Use std_logic data types in I/O Use numeric_std package for arithmetic Use only descending range in the arrays (e.g. downto)

Signal write_r : std_logic_vector(data_width_g-1 downto 0) Signal write_out : std_logic_vector(0 to data_width_g-1)

Parenthesis to show the order of evaluation A and ( x or b)

Erno Salminen, TUT, 2012

General guidelines and hints (2) Assignment delay, such as ”a <= b after x ns”, is

problematic Assigment will be synthesized but not the delay This example will produce a simple wire If you ”fix” bugs in code like this, it won’t work after synthesis Only place to use non-synthesizable code is testbenches

Variables are synthesizable, but… it is harder to figure out the resulting logic than with signals (lec12)

High-impedance state ’Z’ is synthesizable but… simulation results and real HW do not always match (lec12)

Synthesis tools are great but… they behave differently. Some structures are not accepted by all tools

Erno Salminen, TUT, 2012

Notes on combinational circuit synthesis Do not instantiate basic gates (AND, NOR …)

Synthesis tools are very good at Boolean logic optimization Like Karnaugh map/Quine-McCluskey

Not all intermediate signals are preserved in synthesis

Note that the minimal Boolean equation is not necessarily the fastest/smallest/lowest_power depending on the used technology Simple looking (a AND B) might turn to (a’ NOR B’) Correct schematic may seem strange for a novice

Erno Salminen, TUT, 2012

Notes on combinational circuitsynthesis (2) Always write a complete sensitivity list

In every comb. process invocation, every signal must beassigned a value Assigned in every branch or with default assignment Otherwise generates latches to hold the previous values We practically never want to have latches from RTL comb.

processes Usual suspect: Incomplete if-else or such

Avoid combinational loops! The same signal on both sides of assigment

E.g. a <= a+1; -- aargh!

Erno Salminen, TUT, 2012

Notes on sequential circuit synthesis Do not instantiate flip-flops Most synthesis tools do not move (optimize) comb. logic across

flip-flops, at least not by default Possible modifications Propagating constant values may save a lot in area and delay Removing duplicated logic and registers saves area Register duplication can cut long wiring delays Register retiming can move registers and comb.logic to balance the

delays of a pipeline

These are not always enabled by default, and not alwaysautomatically detected by the tools Try first enabling them, and then try modifying your RTL

Erno Salminen, TUT, 2012

Generics and other system settingsmay disable parts of the logic in practice

This may not noted on the unitlevel (first step of synthesis) E.g. sel is an input of ALU

Disbaling is noted when the neighbor units are also synthesizedand connected Preceding unit sets constantlysel=0 and voilá!

Sometimes you must enable thisoptimization explicitly (depends on the tool)

Sometimes this optimizes surpisinglong paths of logic aways Good for area, but might confuse

debugging

Erno Salminen, TUT, 2012

Propagating constantsALU

Removing duplicate logic

Instantiating modules is recommended but may result in duplicate logic

Example bus uses counterstotrack the reservation state

In practice, all the countersinside the interfaces ticksynchronously

Smart synthesis tool (ordesigner) detects this and instantiates only one counter

Note that this also requiresoptimization accross the unitboundaries

Erno Salminen, TUT, 2012

9 ns

Register duplication Wiring has a notable delay Tens of percents in FPGA

Hence, both logic and routing delaysmust be balanced

Example above has 3 paths and critical path delay is 9ns

Long 5ns-wire is split with additionalregister Little bit larger area Critical path reduced to 6ns The outputs of highlighted register

(orig + duplicate) are identical

Erno Salminen, TUT, 2012

5 ns

9 ns

Register retiming

Erno Salminen, TUT, 2012

Moves combinatorial logic to the otherside of a DFF

The function of DFF’s output changes naturally This optimization is not allowed the original output is needed is two

places E.g. MUL has long delay, approx the same as ADD and CMP together Extreme case places pipeline regs (e.g. 3) behind an purely

combinatorial function in RTL Reg retiming splits comb logic approx. evenly Number of connections between small

clouds decides the number of needed parallel regs

Pipeline balancing is very tedious by hand!

Basic optimizations Prove and locate the problem first!

Constant operands simplify Boolean equations For example, consider 4 bit comparatora) x = y

b) x = 0

Smallest possible data width is of course desired

One can share complex units via multiplexing

Iterative algorithms trade area for delay

Even the most basic operations have different costs

Erno Salminen, TUT, 2012

Multiplexers Multiplexers form a large portion of the logic utilization,

especially in FPGA E.g. 30% of Nios II/f soft-core processor area are muxes If-structure generates a priority multiplexer

IF cond1 THEN z <= a;

ELSIF cond2 THEN z <= b;

ELSIF cond3 THEN z <= c;

ELSE z <= d;

END IF;

It is preferred to use caseCASE sel IS

WHEN cond1 => z <= a;

WHEN cond2 => z <= b;

WHEN cond3 => z <= c;

WHEN OTHERS => z <= d;

END CASE;

Creates a balanced multiplexer tree since conditions are mutuallyexclusive by definition

Sometimes, if-elsif does the same, but it is not guaranteed

Erno Salminen, TUT, 2012

Multiplexers #2 Do not let the simplicity of VHDL trick you Multiplexing four 32-bit words requires 130 input bits (2 control bits + 128 data bits), 32 output bits A lot of routing 32 x 4-to-1 muxes A 4-to-1 mux requires three 2-to-1 muxes One 2-to-1 mux implementable in one basic logic element => 3*32=96 2-to-1 muxes required, 96 LEs consumed

Check your design and see if you can reduce the number of choices Only 2 options instead of 3, some options become fixed at synthesis-

time

Erno Salminen, TUT, 2012

Shifters Variable amount shifting is area-hungry Assume that 32-bit vector that can be shifted arbitrary amount

to left or right Needs a 32-to-1 mux for every result bit! 32-to-1 mux = 31 2-to-1 muxes = 31 LEs

32*31 = 992 2-to-1 muxes (=LEs) Non-constant shifters are generally not supported

(automatically) by synthesis tools

An FPGA-specific trick is to use the embedded multipliers to do the dynamic shifting Multiplying by 2n shifts the result to left by n Faster and more area-efficient than doing this with LEs

Erno Salminen, TUT, 2012

Comparators <, >, ==

Avoid implementing == in general logic cells. Comparators are implementable using arithmetic operations and fast carry chains. Calculate a-b and check is the result negative, zero, or positive

Synthesis tools should be aware of this automatically...

Recall that x = a[6:0] < b[6:0] is the same as x = signed(a[6:0]-b[6:0])[7]

The last carry [overflow] of the subtraction

Note: in ASIC’s it may not be feasible to use arithmetics for comparison

Erno Salminen, TUT, 2012

Erno Salminen, TUT, 2012

Example counter to be optimized…process (clk, rst_n)begin

if rst_n...elsif clk’event and clk=’1’then

if (count = value1) then – check limitcount <= 0 ;

elsecount <= count + 1; -- increment

end if ;end process;

Make a timer with run-time adjustable period

Comparator takes here 2 variables, value[3:0] and q[3:0], as inputs

Erno Salminen, TUT, 2012

Optimized basic counterprocess (clk, rst_n)begin

if rst_n...elsif clk’event and clk=’1’thenif (count_r = 0) thencount <= value1; -- initializeelsecount <= count - 1; -- decrement

end if ;end process;

Now we count to constant zero

Comparator takes 1 variable q[3:0] and a constant 0...0

Erno Salminen, TUT, 2012

lminen, TUT, 2012

An example 0.55 um standard-cell CMOS implementation

Subscriptsa = area-optimizedd = delay-optimized

Asymptotic cost:Nand: area is O(n) and time O(1)”>” area is O(n) and time O(n)

Background: Big-O notation for algorithmic complexity

Way to approximate the how the cost increases with the number of inputs n

Function f(n) belongs to class O(g(n)):if n0 and c can be found to satisfy:

f(n) < cg(n) for any n, n > n0

g(n) is simple function: 1, log2n, n, n*log2n, n2, n3, 2n

Following are O(n2):

Erno Salminen, TUT, 2012

Interpretation of Big-O Filter out the “interference”: constants and less important

terms Algorithms with O(2n) is intractable, but already O(n3) is

very bad Not realistic for a larger n Frequently tractable algorithms for sub-optimal solution exist

One may develop a heuristic algorithm They do not guarantee optimal solution, but ususally provide

rather good one with acceptable cost Often utilize pseudo-random choices For example, simulated annealing and genetic algorithms

Erno Salminen, TUT, 2012

E.g.,

intractable

Erno Salminen, TUT, 2012

Specific to FPGA (details in lec10) A lot of registers – use them Aggressive pipelining Objective is to hide the routing delays as much as possible Simple logic stages between registers

Adders Generally, its not beneficial to share adders FPGAs often contain (e.g. Altera) special structures for adders Sharing of adder may cost as much as the adder itself

Hard macros Use whenever appropriate Higher performance than by building one with the FPGA native resources ”they are there anyway”

Embedded multipliers and small SRAMs are common

Erno Salminen, TUT, 2012

FPGA #2 Get to know the properties of the device

E.g. FPGA on-chip memories are typically multiples of 9 bit wide The ninth bit can be used for Control

Parity bit

Otherwise, it is wasted! Memories are typically dual-ported, take advantage of this

Erno Salminen, TUT, 2012

Conclusions Finite state machines can be coded in a variety ways Prefer simplicity, according to in-house coding rules

Coding style has a profound effect on the quality of the hardware Area, max clock frequency Loops Complex assignment logic creates a sea of multiplexers E.g. variable amount left-right shifter

Synthesis tools create different but functionally equivalent netlists even for small designs

Know your FPGA! You might save area and time if using some hard-coded macros However, these are tricks that you should only use on the final

optimization phase

Erno Salminen, TUT, 2012

References1. James Ball, Designing Soft-Core Processors for FPGAs. In

book ”Processor Design. System-on-chip computing for ASICs and FPGAs”, eds. Jari Nurmi.

Erno Salminen, TUT, 2012

Extra slides

Erno Salminen, TUT, 2012

Types Using own types may significantly clarify the code Declaration:

TYPE location ISRECORD

x: INTEGER range 0 to location_max_c-1;y: INTEGER range 0 to location_max_c-1;valid : std_logic;

END RECORD;

TYPE locations_type IS ARRAY (0 to 3) of location;SIGNAL loc_r : locations_type;

Usage:For i in 0 to 3 loop

loc_r(i).x <= i;loc_r(i).y <= 3-i;loc_r(i).valid <= ’1’;

End loop;

x,y,valid

0,3,’1’

1,2,’1’

2,1,’1’

3,0,’1’

Loc_r(0)

Loc_r(1)

Loc_r(2)

Loc_r(3)

Erno Salminen, TUT, 2012

Types #2

Initialization of an array constant:constant a_bound_c : integer := 2;

type vector_2d is array (0 to a_bound_c-1) of std_logic_vector(1 downto 0);

type vector_3d is array (0 to a_bound_c-1) of vector_2d;

constant initial_values_c : vector_3d := (("00", "01"),

("10", "11"));

You may split initilization to multiple lines to increase readability

Initial_values_c

c0\c1 0 1

0 ”00” ”01”

1 ”10” ”11”

c1 = horizontalc0 = vertical

Erno Salminen, TUT, 2012

Types #3

73

Special case, have to use positional assignment:constant ip_types_c : integer := 1;

type ip_vect is array (0 to ip_types_c-1 ) of integer;

constant ip_amount_c : ip_vect := (0 => 1); -- right way

constant ip_amount2_c : ip_vect := (1); -- does not work!

constant ip_amount2_c : ip_vect := 1; -- does not work!

** Error: rtm_pkg.vhd(20): Integer literal 1 is not of type ip_vect.

There is only a single value but it is an array nonetheless

Erno Salminen, TUT, 2012

Not supported by synthesis Signals in packages (global signals) Signal and variable initialization

Typically ignored (there are exceptions, e.g. Xilinx FPGA synthesis)

Unconstrained while and for loops More than one ’event in a process Multiple wait statements Physical types, for example time Access types File types Guard expression (Sensitivity lists, delays and asserts are ignored)

Erno Salminen, TUT, 2012

The result of compilation of one cell (basic gate, dff, mux etc).

metalization

contact

polysilicon

Diffusion p

Diffusion n

Two transistors in series

[M. Perkows, Class VHDL 99, June 2008, http://web.cecs.pdx.edu/~mperkows/CLASS_VHDL_99/June2008/]

CMOS sell layout example

Erno Salminen, TUT, 2012

Standard-cell layout example

Logic has been mapped to basic cells and they have placed-and-routed. In std-cell technology, there are specific rows for placing logic and for wiring. All logic cells in a row share the same voltage supply and ground. Note that cells have uniform height but different width.

Erno Salminen, TUT, 2012

Standard-cell layout example

Same as previous but shown multiple silicon layers at once.

Erno Salminen, TUT, 2012