tkt-1212 digitaalijärjestelmien suunnittelu · · 2013-01-30tkt-1212 digitaalijärjestelmien...
TRANSCRIPT
FSM implementations, VHDL synthesis basics
Erno Salminen, 2013
TKT-1212Digitaalijärjestelmien Suunnittelu
CurrentstateInput
Next
State
Output
Moore
Acknowledgements Prof. Pong . P. Chu provided ”official” slides for the book
which is gratefully aknowledged See also: http://academic.csuohio.edu/chu_p/
Most slides were made by Ari Kulmala and other previous lecturers (Teemu Pitkänen, Konsta Punkka,
Mikko Alho…)
M. Perkows, ECE 590. DIGITAL SYSTEM DESIGN USING HARDARE DESCRIPTION LANGUAGES, Portland State University, June 2008, http://web.cecs.pdx.edu/~mperkows/CLASS_VHDL_99/June2008/
Erno Salminen, TUT, 2012
Logic synthesis - From algorithm to circuit
Algorithm
Architecture
Register level
Gate level
Logic synthesis
Behavioral synthesis
For i = 0 ; i = 15sum = sum + data[I]
Data[0]
Data[15]
i
sum
Data[15] Data[0]
Sum
MEM
clock
0% technology dependent
10% technology dependent
20% technology dependent
100% technology dependent
Adapted from [M. Perkows, Class VHDL 99, June 2008]
FSM
Erno Salminen, TUT, 2012
Finite State Machines
All the previous teachings are still valid and just the description stylechanges
Stopy=z1
Playy=z2
Play x 2y=z3
Rewplay x 2y=z4
Next_tracky=z5
Prev_tracky=z6
in=pl
in=st
in=x2
in=x2
in=next
in=prev
in=othersin=others
in=-x2
in=-x2
in=others
in=others
always
Erno Salminen, TUT, 2012
Finite-state machine (FSM) Application defines the set of correct state machines Two basic flavors: Mealy and Moore In both cases, one must select whether to include the output
registers or not
Moreover, you decide the VHDL presentation of FSM Description style: how many processes Encoding of states, if not automated in synthesis
CurrentstateInput
Next
StateOutput
CurrentstateInput
Next
State
Output
MealyMooreErno Salminen, TUT, 2012
FSM Implementation in VHDL General form:
We define an own type for the state machine states ALWAYS use enumeration type for state machine states synthesis software, e.g. Quartus II, does not recognize it otherwise
architecture rtl of traffic_light is
type states_type is (red, yellow, green);
-- init state explicitly defined in reset, not heresignal ctrl_r : states_type;
...begin -- rtl
Enumeration type for states
Signal ctrl_r is the current state register
Erno Salminen, TUT, 2012
At least 5 implementation styles1. 1 sequential process2. 2 processes
a) Seq: curr. state registers and output, Comb: next state logicb) Seq: curr state, Comb: next state, outputc) Seq: next and curr state, Comb: output
3. 3 processes (Seq: curr state, Comb: output, Comb: nextstate logic separated)
CurrentstateInput
Next
StateOutput
CurrentstateInput
Next
State
Output
MealyMooreErno Salminen, TUT, 2012
Coding style: 1seg-Moore/Reg-Mealy 1-segment style uses just 1 sequential process
sync_all : process (clk, rst_n)begin -- process singleif rst_n = '0' then
<INIT STATE and OUTPUT OF THE FSM>
elsif clk'event and clk = '1' then<define new value of curr state><define output. These outputs becomeregisters!>
end if;end process sync_all;
Erno Salminen, TUT, 2012
Coding style: 2seg-Moore Moore, two segment coding style uses 1 seq. process and 1
comb.proc/concurrent assignments
sync_ps : process (clk, rst_n)begin -- process singleif rst_n = '0' then
<INIT STATE OF THE FSM>
elsif clk'event and clk = '1' then<Synchronous part of the FSM; assign next state to curr state>
end if;end process sync_ps;
comb_ns_output : process (ctrl_r, input)begin -- process output
<Combinational part;define next state, define output>
end process ns_output;
end rtl;
9 Erno Salminen, TUT, 2012
Coding style: 2seg-Mealy Mealy, two-segment coding style is like the previous one
sync_ps : process (clk, rst_n)begin -- process single
if rst_n = '0' then <INIT STATE OF THE FSM>
elsif clk'event and clk = '1' then <Synchronous part of the FSM; assign next state to curr state>
end if;end process sync_ps;comb_ns_output : process (ctrl_r, input)begin -- process output
<Combinational part;define next state,define output(Looks same as Moore, but here alsothe input is considered for output)>
end process ns_output;end rtl;
Erno Salminen, TUT, 2012
Coding style: 3seg_Moore/Mealy 3-segment style separates next state and outpu logic into their own
comb. Processes
sync_ps : process (clk, rst_n)begin -- process single
if rst_n = '0' then<INIT STATE OF THE FSM>
elsif clk'event and clk = '1' then<curr state assignment>
end if;end process sync_ps;
comb_ns : process (ctrl_r, input)begin -- process output
<Combinational part;define next state>
end process comb_ns;
comb_output : process (ctrl_r, input)begin -- process output
<Combinational part;define output>
end process comb_output;
Erno Salminen, TUT, 2012
Examples: Traffic light FSM implemented with various styles Examples shown as VHDL code Bottom of page
http://www.tkt.cs.tut.fi/kurssit/1212/K11/luennot.html
They also show the usage of counter in state machine Acts as a timer for showing yellow light
Output latency is larger for Moore and registered Mealy
However, all the implementations keep yellow light on for same amount of time Sometimes designer must modify timer limit values, e.g. ifcounter_r = limit-1 instead of counter_r = limit)
Erno Salminen, TUT, 2012
Timing of trafficlight FSMs All versions show yellow light the
amount of time r_y_g = ”010”
However, there is 1 cycle differencewhen yellow is ON Inside the DUV, state and counter values
are aligned differently as well
I. Mealy 2a,2b,3 react immediately No output reg, only comb. delay from req
to r_y_g (=0ns in RTL simulation)
II. Mealy 1 needs 1 cycle Output reg and curr_state_r change
simultaneously VHDL code assigns them when the
state changes
III. Moore 2 needs 1 cycle Updating curr_state_r takes 1 cycle Output assigned combinatorially
from curr_state_r (=0ns in RTL simulation)
Mealy_1
Mealy_2a
Mealy_2b
Mealy_3
Moore_2
II
I
IIIErno Salminen, TUT, 2012
Comparison of implementation styles: Coding style 1-segment:
Just sync processAutomatically inferred output registers Simple view to the design, everything at one place Safe, registered Mealy machine is easy to implement with this style Recommended (as opposite to the course book!)
2-segment, 3-segment Only way to implement unregistered outputs to FSMs Modular Long ago synthesis tools did not recognize 1-segment FSMs correctly
Not the case anymore Recommended style in many books, partially because of those limitations of the old tools
Useful for quite simple control machines that do not have e.g. delay countersincluded
Complex state machines are cumbersome to read The code does not proceed smoothly, have to jump around the code The same conditions may be repeated in many processes
Erno Salminen, TUT, 2012
Quartus II design flow after you’ve simulated and verified the design
Generic gate-level representation, just gates and flip-flops
Places and routes the logic into a device, logic elements, macros and routing cells
Converts the post-fit netlist into a FPGA programming file (.sof)
Analyzes and validates the timing performance of all logic in a design.
Run on FPGAErno Salminen, TUT, 2012
Examples: extracted state diagramTool A
Note the encoding: it’s not basic binary nor one-hot.Erno Salminen, TUT, 2012
RTL Synthesis result: Tool ARegister for output bit 2
Registers for output bits 1..0
Registered mealy machine, traffic light VHD
State register
Combinatorialoutput logic
Next state logic, incl. counter for showing yellow lightlong enough
Comb path frominput to output logic
Erno Salminen, TUT, 2012
Technology schematic, Tool A
Logic on previous slide has been mapped to FPGA’s primitives.Registered mealy machine, traffic light VHD
Single flip-flops
Look-up tables, max 6 inputs
Erno Salminen, TUT, 2012
RTL Synthesis result: Tool B Same VHDL, slightly different result
Registered mealy machine, traffic light VHD
# Info: [45144]: Extracted FSM in module work.traffic_light(rtl){generic map (n_colors_g => 3 yellow_length_g => 10)}, with state variable = ctrl_r[1:0], async set/reset state(s) = 00 , number of states = 3.# Info: [45144]: Preserving the original encoding in 3 state FSM# Info: [45144]: FSM: State encoding table.# Info: [40000]: FSM: Index Literal Encoding# Info: [40000]: FSM: 0 00 00# Info: [40000]: FSM: 1 01 01# Info: [40000]: FSM: 2 10 10
Note the different state encoding
Erno Salminen, TUT, 2012
Technology schematic, Tool B
Multi-bit registers
LUTs
Registered mealy machine, traffic light VHDErno Salminen, TUT, 2012
Physical placement on-chipThe traffic_light.vhd place and routed
Stratix 2S180, 143 000 ALUTs (~LUTs)
Quite much unused resources...Erno Salminen, TUT, 2012
Implementation area and freq Note that no strict generalization can be made about the
”betterness” Tool A Total ALUTs 15 ALUTs with register 10
Tool B LUTs 16 Registers 9
The one register difference is due to the different stateencoding The state encoding can be explicitly defined or left to the tool
to choose (as in this case)
Erno Salminen, TUT, 2012
Synthesis of different VHDLs
Functionally equivalent
Timing aspect vary Different max frequency Only the ”Mealy single” has output registers (3 bit)
Coding style has a minor effect here
Readibility of the code is as crucial!
AREA [LUT] AREA [reg] Lines of Codemealy (single) 16 9 104mealy (output separated) 13 6 126mealy_2proc. (out+ns separated) 11 6 125mealy_3proc 11 6 150Moore 11 6 108
Erno Salminen, TUT, 2012
Comparison of implementation styles: Moore and Mealy Number of processes does not affect HW area and speed
deterministically. The differences are mainly in readability
Generally, we want that outputs are registered Trad. Mealy machine is sometimes problematic due to possible
long combinational paths or loops
For registered outputs, use a registered Mealy machine Outputs are registered, but has shorter latency than Moore
machine with registered outputs
Otherwise, opt for Moore machine
Erno Salminen, TUT, 2012
Notes on finite state machines Quite often datapath and control get mixed in
HDL description
Start with ”slow and simple FSM” if neighbor blocks allow that Takes few extra cycles but has less branching and
hence simpler conditions Easier to get it working at all You can later reduce few clock cycles by skipping
some state in certain conditions (e.g. adding red arc wait_ack -> write)
Be careful with the timing of output register
Wait data
writeWait ack
Erno Salminen, TUT, 2012
Synthesis observations Make sure that you are aware of what signals of the shown codes have
been implemented as registers! In most cases, use enumeration and let the synthesis tool to decide the
state encoding Not much difference in delay or area in realistic circuits One-hot encoding is easiest to debug!
Different tools produce slightly different results in even small designs Synthesis tools are heuristic due to very large design space Modest effect (e.g. -10%- +10%) also achieavable by tuning the tool settings
Even a single tool may produce slightly different results on different runs! Optimization heuristics utilize randomness
However, no tool can convert a bad design into a good one!
Erno Salminen, TUT, 2012
Foreword: VHDL and synthesis The main goal of writingVHDL is to generate synthesizable
description
This lecture presents some practical examples of how to write code that is good for synthesis
The quality of the design is much affected by the coding style you must be able to choose structures that synthesize the best
Erno Salminen, TUT, 2012
Logic design basics still apply Modularize the design to components Easier to design single components Easier to upgrade Prefer registered outputs
Use version control system (e.g. SVN or git) Thou shalt not mess with clock and reset Use only rising-edge clocking. Preferably active-high enable signals Asynchronous reset is used only to initialize Not part of the functionality Hence, you don’t force reset from your code Use separate clear-signal or similar if needed. That is checked on the
edge sensitive if-branch (lec 12)
Erno Salminen, TUT, 2012
Simplest synthesis example In this course, we concentrate on RTL synthesis: how is HDL
converted into netlist of basic gates and flip-flops Technology mapping, routing and placement are beyond the scope of this
course
Example: arith_result <= a + b + c - 1; The resulting combinatorial logic is straightforward
Inclusion of DFF depends on the context (inside a sync process or not)
Erno Salminen, TUT, 2012
Example: if-else
Conceptual structure of nested if-clauses in HDL
Conceptual hardware realization (when none of the value_expr_x is ”ZZ…Z”. High-impedance covered in Lec 12. )
Erno Salminen, TUT, 2012
Example: selected assignment
Note the similarityto –if-clause inside a process
Fig 1. Basic form of synthesized logic
Fig 2. Full logic
Similar to if-else
Erno Salminen, TUT, 2012
Logic from case-clause
Conceptual structure of case-clause in HDL Conceptual HW realization
(when tri-state isn’t used)
This example has 2 outputs but again the logic is similar to if-clause
Erno Salminen, TUT, 2012
More complexSelectedassignemt
Fig1. Conceptual hardware realizationFig 2. Full logic
Erno Salminen, TUT, 2012
Example: loop
Bounds must be static , likehere (3 down 0)
The loop is ”unrolled” in logic
Evertyhing happens in parallel!
Hence, the loop is equivalentto
Sidenote: y <= a xor b is evenbetter with std_logic_vectors, butthen we would not have an example of a loop Erno Salminen, TUT, 2012
Loops: example2 in SW With software
for (i = 0; i < max_c; i++) {b(i) = a(i) + i;
}
Iterative calculation for b(i) (simplified)1. Calculate for-clause2. Fetch a3. Add4. Store b5. Increment i6. Go back to 1
Takes a lot of clock cycles (several even with loop-unrolling)
Erno Salminen, TUT, 2012
Loops: example2 in HW Hardware:
add_i: for i in 0 to max_c-1 generateb(i) <= a(i) + i;
end generate add_i;
Generates <max_c> parallel computation units High area overhead
Result generated in 1 clock cycle, very fast!
However, in HW we can adjust the area-performance ratio Pipeline e.g. half of the result on the first cycle, rest on the
second Fully sequential (the SW case), still faster than SW, needs FSM
Erno Salminen, TUT, 2012
Sharing Arithmetic operators Large implementation Limited optimization by synthesis software Data width has a major impact
Area reduction can be achieved by “sharing” in RT level coding Operator sharing Functionality sharing
Clever tools may do that automatically and not-so-smart onesneed some guidance (different coding style)
Erno Salminen, TUT, 2012
Sharing (2) Possible when “value expressions” in priority network and multiplexing
network are mutually exclusively: Only one result is routed to output Generic format of conditional signal assignment guarantees this:
sig_name <= value_expr_1 when boolean_expr_1 elsevalue_expr_2 when boolean_expr_2 elsevalue_expr_3 when boolean_expr_3 else. . .
value_expr_n;
Erno Salminen, TUT, 2012
Sharing example 1 Original code:
r <= a+b when boolean_exp elsea+c;
Revised code (enables sharing):src0 <= b when boolean_exp else
c;r <= a + src0;
NOTE:Coded outside a process
Erno Salminen, TUT, 2012
Area: 2 adders, 1 mux, Bool
Delay:
Area: 1 adder, 1 mux, Bool
Delay:
However, no free lunch in general: sharing reduces area A but increases delay T in this caseErno Salminen, TUT, 2012
Sharing example 2 Original code:
process(a,b,c,d,...)begin
if boolean_exp thenr <= a+b;
elser <= a+c;
end if;end process;
NOTE:Coded inside a processEquivalent with previous
Revised code:process(a,b,c,d,...)begin
if boolean_exp thensrc0 <= a;src1 <= b;
elsesrc0 <= a;src1 <= c;
end if;end process;r <= src0 + src1;
Erno Salminen, TUT, 2012
Synthesis is integral part right from the start Separate non-synthesizable (testbench) code into their own
entities Start trial syntheses early. Non-synthesizable structures will be
detected early as well Automate synthesis runs with scripts Often it is good to perform parameter sweep, e.g. synthesize data
widths 8, 16,…128 bits Do not optimize your design before the HW cost (area, delay)
have been proven! You can skip non-synthesizable parts with pragmas. Use with care-- synthesis translate_offuse std.textio.all;-- synthesis translate_on
Erno Salminen, TUT, 2012
General guidelines and hints Initial values for signals are not generally synthesizable Used only simulator (and some synthesis tools) You must reset all sequential (control) signals explicitly. You must NOT reset any combinatorial signals or those coming
from component instances’ outputs
Use std_logic data types in I/O Use numeric_std package for arithmetic Use only descending range in the arrays (e.g. downto)
Signal write_r : std_logic_vector(data_width_g-1 downto 0) Signal write_out : std_logic_vector(0 to data_width_g-1)
Parenthesis to show the order of evaluation A and ( x or b)
Erno Salminen, TUT, 2012
General guidelines and hints (2) Assignment delay, such as ”a <= b after x ns”, is
problematic Assigment will be synthesized but not the delay This example will produce a simple wire If you ”fix” bugs in code like this, it won’t work after synthesis Only place to use non-synthesizable code is testbenches
Variables are synthesizable, but… it is harder to figure out the resulting logic than with signals (lec12)
High-impedance state ’Z’ is synthesizable but… simulation results and real HW do not always match (lec12)
Synthesis tools are great but… they behave differently. Some structures are not accepted by all tools
Erno Salminen, TUT, 2012
Notes on combinational circuit synthesis Do not instantiate basic gates (AND, NOR …)
Synthesis tools are very good at Boolean logic optimization Like Karnaugh map/Quine-McCluskey
Not all intermediate signals are preserved in synthesis
Note that the minimal Boolean equation is not necessarily the fastest/smallest/lowest_power depending on the used technology Simple looking (a AND B) might turn to (a’ NOR B’) Correct schematic may seem strange for a novice
Erno Salminen, TUT, 2012
Notes on combinational circuitsynthesis (2) Always write a complete sensitivity list
In every comb. process invocation, every signal must beassigned a value Assigned in every branch or with default assignment Otherwise generates latches to hold the previous values We practically never want to have latches from RTL comb.
processes Usual suspect: Incomplete if-else or such
Avoid combinational loops! The same signal on both sides of assigment
E.g. a <= a+1; -- aargh!
Erno Salminen, TUT, 2012
Notes on sequential circuit synthesis Do not instantiate flip-flops Most synthesis tools do not move (optimize) comb. logic across
flip-flops, at least not by default Possible modifications Propagating constant values may save a lot in area and delay Removing duplicated logic and registers saves area Register duplication can cut long wiring delays Register retiming can move registers and comb.logic to balance the
delays of a pipeline
These are not always enabled by default, and not alwaysautomatically detected by the tools Try first enabling them, and then try modifying your RTL
Erno Salminen, TUT, 2012
Generics and other system settingsmay disable parts of the logic in practice
This may not noted on the unitlevel (first step of synthesis) E.g. sel is an input of ALU
Disbaling is noted when the neighbor units are also synthesizedand connected Preceding unit sets constantlysel=0 and voilá!
Sometimes you must enable thisoptimization explicitly (depends on the tool)
Sometimes this optimizes surpisinglong paths of logic aways Good for area, but might confuse
debugging
Erno Salminen, TUT, 2012
Propagating constantsALU
Removing duplicate logic
Instantiating modules is recommended but may result in duplicate logic
Example bus uses counterstotrack the reservation state
In practice, all the countersinside the interfaces ticksynchronously
Smart synthesis tool (ordesigner) detects this and instantiates only one counter
Note that this also requiresoptimization accross the unitboundaries
Erno Salminen, TUT, 2012
9 ns
Register duplication Wiring has a notable delay Tens of percents in FPGA
Hence, both logic and routing delaysmust be balanced
Example above has 3 paths and critical path delay is 9ns
Long 5ns-wire is split with additionalregister Little bit larger area Critical path reduced to 6ns The outputs of highlighted register
(orig + duplicate) are identical
Erno Salminen, TUT, 2012
5 ns
9 ns
Register retiming
Erno Salminen, TUT, 2012
Moves combinatorial logic to the otherside of a DFF
The function of DFF’s output changes naturally This optimization is not allowed the original output is needed is two
places E.g. MUL has long delay, approx the same as ADD and CMP together Extreme case places pipeline regs (e.g. 3) behind an purely
combinatorial function in RTL Reg retiming splits comb logic approx. evenly Number of connections between small
clouds decides the number of needed parallel regs
Pipeline balancing is very tedious by hand!
Basic optimizations Prove and locate the problem first!
Constant operands simplify Boolean equations For example, consider 4 bit comparatora) x = y
b) x = 0
Smallest possible data width is of course desired
One can share complex units via multiplexing
Iterative algorithms trade area for delay
Even the most basic operations have different costs
Erno Salminen, TUT, 2012
Multiplexers Multiplexers form a large portion of the logic utilization,
especially in FPGA E.g. 30% of Nios II/f soft-core processor area are muxes If-structure generates a priority multiplexer
IF cond1 THEN z <= a;
ELSIF cond2 THEN z <= b;
ELSIF cond3 THEN z <= c;
ELSE z <= d;
END IF;
It is preferred to use caseCASE sel IS
WHEN cond1 => z <= a;
WHEN cond2 => z <= b;
WHEN cond3 => z <= c;
WHEN OTHERS => z <= d;
END CASE;
Creates a balanced multiplexer tree since conditions are mutuallyexclusive by definition
Sometimes, if-elsif does the same, but it is not guaranteed
Erno Salminen, TUT, 2012
Multiplexers #2 Do not let the simplicity of VHDL trick you Multiplexing four 32-bit words requires 130 input bits (2 control bits + 128 data bits), 32 output bits A lot of routing 32 x 4-to-1 muxes A 4-to-1 mux requires three 2-to-1 muxes One 2-to-1 mux implementable in one basic logic element => 3*32=96 2-to-1 muxes required, 96 LEs consumed
Check your design and see if you can reduce the number of choices Only 2 options instead of 3, some options become fixed at synthesis-
time
Erno Salminen, TUT, 2012
Shifters Variable amount shifting is area-hungry Assume that 32-bit vector that can be shifted arbitrary amount
to left or right Needs a 32-to-1 mux for every result bit! 32-to-1 mux = 31 2-to-1 muxes = 31 LEs
32*31 = 992 2-to-1 muxes (=LEs) Non-constant shifters are generally not supported
(automatically) by synthesis tools
An FPGA-specific trick is to use the embedded multipliers to do the dynamic shifting Multiplying by 2n shifts the result to left by n Faster and more area-efficient than doing this with LEs
Erno Salminen, TUT, 2012
Comparators <, >, ==
Avoid implementing == in general logic cells. Comparators are implementable using arithmetic operations and fast carry chains. Calculate a-b and check is the result negative, zero, or positive
Synthesis tools should be aware of this automatically...
Recall that x = a[6:0] < b[6:0] is the same as x = signed(a[6:0]-b[6:0])[7]
The last carry [overflow] of the subtraction
Note: in ASIC’s it may not be feasible to use arithmetics for comparison
Erno Salminen, TUT, 2012
Example counter to be optimized…process (clk, rst_n)begin
if rst_n...elsif clk’event and clk=’1’then
if (count = value1) then – check limitcount <= 0 ;
elsecount <= count + 1; -- increment
end if ;end process;
Make a timer with run-time adjustable period
Comparator takes here 2 variables, value[3:0] and q[3:0], as inputs
Erno Salminen, TUT, 2012
Optimized basic counterprocess (clk, rst_n)begin
if rst_n...elsif clk’event and clk=’1’thenif (count_r = 0) thencount <= value1; -- initializeelsecount <= count - 1; -- decrement
end if ;end process;
Now we count to constant zero
Comparator takes 1 variable q[3:0] and a constant 0...0
Erno Salminen, TUT, 2012
lminen, TUT, 2012
An example 0.55 um standard-cell CMOS implementation
Subscriptsa = area-optimizedd = delay-optimized
Asymptotic cost:Nand: area is O(n) and time O(1)”>” area is O(n) and time O(n)
Background: Big-O notation for algorithmic complexity
Way to approximate the how the cost increases with the number of inputs n
Function f(n) belongs to class O(g(n)):if n0 and c can be found to satisfy:
f(n) < cg(n) for any n, n > n0
g(n) is simple function: 1, log2n, n, n*log2n, n2, n3, 2n
Following are O(n2):
Erno Salminen, TUT, 2012
Interpretation of Big-O Filter out the “interference”: constants and less important
terms Algorithms with O(2n) is intractable, but already O(n3) is
very bad Not realistic for a larger n Frequently tractable algorithms for sub-optimal solution exist
One may develop a heuristic algorithm They do not guarantee optimal solution, but ususally provide
rather good one with acceptable cost Often utilize pseudo-random choices For example, simulated annealing and genetic algorithms
Erno Salminen, TUT, 2012
Specific to FPGA (details in lec10) A lot of registers – use them Aggressive pipelining Objective is to hide the routing delays as much as possible Simple logic stages between registers
Adders Generally, its not beneficial to share adders FPGAs often contain (e.g. Altera) special structures for adders Sharing of adder may cost as much as the adder itself
Hard macros Use whenever appropriate Higher performance than by building one with the FPGA native resources ”they are there anyway”
Embedded multipliers and small SRAMs are common
Erno Salminen, TUT, 2012
FPGA #2 Get to know the properties of the device
E.g. FPGA on-chip memories are typically multiples of 9 bit wide The ninth bit can be used for Control
Parity bit
Otherwise, it is wasted! Memories are typically dual-ported, take advantage of this
Erno Salminen, TUT, 2012
Conclusions Finite state machines can be coded in a variety ways Prefer simplicity, according to in-house coding rules
Coding style has a profound effect on the quality of the hardware Area, max clock frequency Loops Complex assignment logic creates a sea of multiplexers E.g. variable amount left-right shifter
Synthesis tools create different but functionally equivalent netlists even for small designs
Know your FPGA! You might save area and time if using some hard-coded macros However, these are tricks that you should only use on the final
optimization phase
Erno Salminen, TUT, 2012
References1. James Ball, Designing Soft-Core Processors for FPGAs. In
book ”Processor Design. System-on-chip computing for ASICs and FPGAs”, eds. Jari Nurmi.
Erno Salminen, TUT, 2012
Types Using own types may significantly clarify the code Declaration:
TYPE location ISRECORD
x: INTEGER range 0 to location_max_c-1;y: INTEGER range 0 to location_max_c-1;valid : std_logic;
END RECORD;
TYPE locations_type IS ARRAY (0 to 3) of location;SIGNAL loc_r : locations_type;
Usage:For i in 0 to 3 loop
loc_r(i).x <= i;loc_r(i).y <= 3-i;loc_r(i).valid <= ’1’;
End loop;
x,y,valid
0,3,’1’
1,2,’1’
2,1,’1’
3,0,’1’
Loc_r(0)
Loc_r(1)
Loc_r(2)
Loc_r(3)
Erno Salminen, TUT, 2012
Types #2
Initialization of an array constant:constant a_bound_c : integer := 2;
type vector_2d is array (0 to a_bound_c-1) of std_logic_vector(1 downto 0);
type vector_3d is array (0 to a_bound_c-1) of vector_2d;
constant initial_values_c : vector_3d := (("00", "01"),
("10", "11"));
You may split initilization to multiple lines to increase readability
Initial_values_c
c0\c1 0 1
0 ”00” ”01”
1 ”10” ”11”
c1 = horizontalc0 = vertical
Erno Salminen, TUT, 2012
Types #3
73
Special case, have to use positional assignment:constant ip_types_c : integer := 1;
type ip_vect is array (0 to ip_types_c-1 ) of integer;
constant ip_amount_c : ip_vect := (0 => 1); -- right way
constant ip_amount2_c : ip_vect := (1); -- does not work!
constant ip_amount2_c : ip_vect := 1; -- does not work!
** Error: rtm_pkg.vhd(20): Integer literal 1 is not of type ip_vect.
There is only a single value but it is an array nonetheless
Erno Salminen, TUT, 2012
Not supported by synthesis Signals in packages (global signals) Signal and variable initialization
Typically ignored (there are exceptions, e.g. Xilinx FPGA synthesis)
Unconstrained while and for loops More than one ’event in a process Multiple wait statements Physical types, for example time Access types File types Guard expression (Sensitivity lists, delays and asserts are ignored)
Erno Salminen, TUT, 2012
The result of compilation of one cell (basic gate, dff, mux etc).
metalization
contact
polysilicon
Diffusion p
Diffusion n
Two transistors in series
[M. Perkows, Class VHDL 99, June 2008, http://web.cecs.pdx.edu/~mperkows/CLASS_VHDL_99/June2008/]
CMOS sell layout example
Erno Salminen, TUT, 2012
Standard-cell layout example
Logic has been mapped to basic cells and they have placed-and-routed. In std-cell technology, there are specific rows for placing logic and for wiring. All logic cells in a row share the same voltage supply and ground. Note that cells have uniform height but different width.
Erno Salminen, TUT, 2012