vhdl reference

E&CE 327: Digital Systems EngineeringCourse Notes

(with Solutions)

Mark Aagaard2011t1–Winter

University of WaterlooDept of Electrical and Computer Engineering

Contents

I Course Notes 1

1 VHDL 31.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 3

1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 31.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . . . . . . . .. 41.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.1.4 Synthesis of a Simulation-Based Language . . . . . . . . . .. . . . . . . 71.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . . . . .. . . . 71.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . 91.2.1 VHDL Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.2 VHDL Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.3 VHDL and Other Languages . . . . . . . . . . . . . . . . . . . . . . . . .10

1.2.3.1 VHDL vs Verilog . . . . . . . . . . . . . . . . . . . . . . . . . 101.2.3.2 VHDL vs System Verilog . . . . . . . . . . . . . . . . . . . . . 101.2.3.3 VHDL vs SystemC . . . . . . . . . . . . . . . . . . . . . . . . 101.2.3.4 Summary of VHDL Evaluation . . . . . . . . . . . . . . . . . . 11

1.3 Overview of Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 111.3.1 Syntactic Categories . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 111.3.2 Library Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.3 Entities and Architecture . . . . . . . . . . . . . . . . . . . . . . .. . . . 121.3.4 Concurrent Statements . . . . . . . . . . . . . . . . . . . . . . . . . .. . 141.3.5 Component Declaration and Instantiations . . . . . . . . .. . . . . . . . . 161.3.6 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.3.7 Sequential Statements . . . . . . . . . . . . . . . . . . . . . . . . . .. . 171.3.8 A Few More Miscellaneous VHDL Features . . . . . . . . . . . . .. . . 18

1.4 Concurrent vs Sequential Statements . . . . . . . . . . . . . . . .. . . . . . . . . 181.4.1 Concurrent Assignment vs Process . . . . . . . . . . . . . . . . .. . . . . 181.4.2 Conditional Assignment vs If Statements . . . . . . . . . . .. . . . . . . 181.4.3 Selected Assignment vs Case Statement . . . . . . . . . . . . .. . . . . . 191.4.4 Coding Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.5 Overview of Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 201.5.1 Combinational Process vs Clocked Process . . . . . . . . . .. . . . . . . 221.5.2 Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

i

ii CONTENTS

1.5.3 Combinational vs Flopped Signals . . . . . . . . . . . . . . . . .. . . . . 251.6 Details of Process Execution . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 25

1.6.1 Simple Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .251.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . .. . . . . . . 261.6.3 Intuition Behind Delta-Cycle Simulation . . . . . . . . . .. . . . . . . . 271.6.4 Definitions and Algorithm . . . . . . . . . . . . . . . . . . . . . . . .. . 27

1.6.4.1 Process Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 271.6.4.2 Simulation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 281.6.4.3 Delta-Cycle Definitions . . . . . . . . . . . . . . . . . . . . . . 30

1.6.5 Example 1: Process Execution (Bamboozle) . . . . . . . . . .. . . . . . . 311.6.6 Example 2: Process Execution (Flummox) . . . . . . . . . . . .. . . . . 401.6.7 Example: Need for Provisional Assignments . . . . . . . . .. . . . . . . 421.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . . . . . . .. . . . . . . . 44

1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . . . .. . . . . . . . . . . 501.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501.7.2 Technique for Register-Transfer Level Simulation . .. . . . . . . . . . . . 521.7.3 Examples of RTL Simulation . . . . . . . . . . . . . . . . . . . . . . .. . 53

1.7.3.1 RTL Simulation Example 1 . . . . . . . . . . . . . . . . . . . . 531.8 VHDL and Hardware Building Blocks . . . . . . . . . . . . . . . . . . .. . . . . 58

1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . .. 581.8.2 Deprecated Building Blocks for RTL . . . . . . . . . . . . . . . .. . . . 59

1.8.2.1 An Aside on Flip-Flops and Latches . . . . . . . . . . . . . . .591.8.2.2 Deprecated Hardware . . . . . . . . . . . . . . . . . . . . . . . 59

1.8.3 Hardware and Code for Flops . . . . . . . . . . . . . . . . . . . . . . .. 601.8.3.1 Flops with Waits and Ifs . . . . . . . . . . . . . . . . . . . . . . 601.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . 601.8.3.3 Flops with Chip-Enable . . . . . . . . . . . . . . . . . . . . . . 611.8.3.4 Flop with Chip-Enable and Mux on Input . . . . . . . . . . . .. 611.8.3.5 Flops with Chip-Enable, Muxes, and Reset . . . . . . . . .. . . 62

1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . . . . . . . .. . . 621.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 661.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 67

1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 681.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . . . . . . . .. . . . 681.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . . . . . . . .. . . . 681.10.4 Different Widths and Arithmetic . . . . . . . . . . . . . . . . .. . . . . . 691.10.5 Overloading of Comparisons . . . . . . . . . . . . . . . . . . . . .. . . . 691.10.6 Different Widths and Comparisons . . . . . . . . . . . . . . . .. . . . . . 691.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . .. . . . . . . . . . . 711.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . . . . . . . . . .. . 72

1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721.11.1.3 Different Wait Conditions . . . . . . . . . . . . . . . . . . . .. 721.11.1.4 Multiple “if risingedge” in Process . . . . . . . . . . . . . . . . 73

CONTENTS iii

1.11.1.5 “if risingedge” and “wait” in Same Process . . . . . . . . . . . 731.11.1.6 “if risingedge” with “else” Clause . . . . . . . . . . . . . . . . 741.11.1.7 “if risingedge” Inside a “for” Loop . . . . . . . . . . . . . . . . 741.11.1.8 “wait” Inside of a “for loop” . . . . . . . . . . . . . . . . . . .75

1.11.2 Synthesizable, but Bad Coding Practices . . . . . . . . . .. . . . . . . . . 761.11.2.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 761.11.2.2 Combinational “if-then” Without “else” . . . . . . . .. . . . . 771.11.2.3 Bad Form of Nested Ifs . . . . . . . . . . . . . . . . . . . . . . 771.11.2.4 Deeply Nested Ifs . . . . . . . . . . . . . . . . . . . . . . . . . 77

1.11.3 Synthesizable, but Unpredictable Hardware . . . . . . .. . . . . . . . . . 781.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . .. . . . . . . . . 78

1.12.1 Signal Declarations . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 781.12.2 Flip-Flops and Latches . . . . . . . . . . . . . . . . . . . . . . . . .. . . 791.12.3 Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 791.12.4 Multiplexors and Tri-State Signals . . . . . . . . . . . . . .. . . . . . . . 791.12.5 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801.12.6 State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .801.12.7 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

1.13 VHDL Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . .. . . . . . . . 85P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . .86P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . .. . . 89P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . .. . . 89P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . .. 91P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl . . . . . .. . . . . 92P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega . . . .. . . . . . 93P1.11 Waveform — VHDL Behavioural Comparison . . . . . . . . . . . .. . . 95P1.12 Hardware — VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . 97P1.13 8-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

P1.13.1 Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . 98P1.13.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98P1.13.3 Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . 98

P1.14 Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . .. . . 99P1.15 Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

P1.15.1 Correct Implementation? . . . . . . . . . . . . . . . . . . . . . 101P1.15.2 Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . 104P1.15.3 Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . 104

iv CONTENTS

2 RTL Design with VHDL 1052.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 105

2.1.1 A Note on EDA for FPGAs and ASICs . . . . . . . . . . . . . . . . . . . 1052.2 FPGA Background and Coding Guidelines . . . . . . . . . . . . . . .. . . . . . . 106

2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . .1062.2.1.1 Generic FPGA Cell . . . . . . . . . . . . . . . . . . . . . . . . 106

2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1062.2.2.1 Interconnect for Generic FPGA . . . . . . . . . . . . . . . . . .1122.2.2.2 Blocks of Cells for Generic FPGA . . . . . . . . . . . . . . . . 1122.2.2.3 Clocks for Generic FPGAs . . . . . . . . . . . . . . . . . . . . 1142.2.2.4 Special Circuitry in FPGAs . . . . . . . . . . . . . . . . . . . . 114

2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . . . . . . . . . .. . . . 1152.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116

2.3.1 Generic Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1162.3.2 Implementation Flows . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1172.3.3 Design Flow: Datapath vs Control vs Storage . . . . . . . . .. . . . . . . 118

2.3.3.1 Classes of Hardware . . . . . . . . . . . . . . . . . . . . . . . . 1182.3.3.2 Datapath-Centric Design Flow . . . . . . . . . . . . . . . . . .1192.3.3.3 Control-Centric Design Flow . . . . . . . . . . . . . . . . . . .1202.3.3.4 Storage-Centric Design Flow . . . . . . . . . . . . . . . . . . .120

2.4 Algorithms and High-Level Models . . . . . . . . . . . . . . . . . . .. . . . . . 1202.4.1 Flow Charts and State Machines . . . . . . . . . . . . . . . . . . . .. . . 1212.4.2 Data-Dependency Graphs . . . . . . . . . . . . . . . . . . . . . . . . .. 1212.4.3 High-Level Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . . . . . . . . .. . . . 1232.5.1 Introduction to State-Machine Design . . . . . . . . . . . . .. . . . . . . 123

2.5.1.1 Mealy vs Moore State Machines . . . . . . . . . . . . . . . . . 1232.5.1.2 Introduction to State Machines and VHDL . . . . . . . . . .. . 1232.5.1.3 Explicit vs Implicit State Machines . . . . . . . . . . . . .. . . 124

2.5.2 Implementing a Simple Moore Machine . . . . . . . . . . . . . . .. . . . 1252.5.2.1 Implicit Moore State Machine . . . . . . . . . . . . . . . . . . .1262.5.2.2 Explicit Moore with Flopped Output . . . . . . . . . . . . . .. 1272.5.2.3 Explicit Moore with Combinational Outputs . . . . . . .. . . . 1282.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment . . . 1292.5.2.5 Explicit-Current+Next Moore with Combinational Process . . . 130

2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . . . . . . .. . . . 1312.5.3.1 Implicit Mealy State Machine . . . . . . . . . . . . . . . . . . .1322.5.3.2 Explicit Mealy State Machine . . . . . . . . . . . . . . . . . . .1332.5.3.3 Explicit-Current+Next Mealy . . . . . . . . . . . . . . . . . .. 134

2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1352.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

2.5.5.1 Constants vs Enumerated Type . . . . . . . . . . . . . . . . . . 1372.5.5.2 Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 138

2.6 Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1392.6.1 Dataflow Diagrams Overview . . . . . . . . . . . . . . . . . . . . . . .. 139

CONTENTS v

2.6.2 Dataflow Diagrams, Hardware, and Behaviour . . . . . . . . .. . . . . . 1422.6.3 Dataflow Diagram Execution . . . . . . . . . . . . . . . . . . . . . . .. . 1432.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . .. . 1442.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1452.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . . . . . . .. . . 145

2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 1482.7.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1482.7.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1492.7.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . .. . 1492.7.4 Dataflow Diagram Scheduling . . . . . . . . . . . . . . . . . . . . . .. . 1502.7.5 Optimize Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . .. . . 1522.7.6 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . .. . . 1542.7.7 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1562.7.8 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1572.7.9 Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . . . . . . . . .. . 1582.7.10 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . .. . . 160

2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 1622.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1632.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1632.8.3 Initial Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . .. . 1642.8.4 Reschedule to Meet Requirements . . . . . . . . . . . . . . . . . .. . . . 1642.8.5 Optimize Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1652.8.6 Assign Names to Registered Values . . . . . . . . . . . . . . . . .. . . . 1672.8.7 Input/Output Allocation . . . . . . . . . . . . . . . . . . . . . . . .. . . 1682.8.8 Tangent: Combinational Outputs . . . . . . . . . . . . . . . . . .. . . . . 1702.8.9 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1712.8.10 Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1732.8.11 Hardware Block Diagram and State Machine . . . . . . . . . .. . . . . . 173

2.8.11.1 Control for Registers . . . . . . . . . . . . . . . . . . . . . . . 1732.8.11.2 Control for Datapath Components . . . . . . . . . . . . . . .. . 1742.8.11.3 Control for State . . . . . . . . . . . . . . . . . . . . . . . . . . 1752.8.11.4 Complete State Machine Table . . . . . . . . . . . . . . . . . .175

2.8.12 VHDL Code with Explicit State Machine . . . . . . . . . . . . .. . . . . 1762.8.13 Peephole Optimizations . . . . . . . . . . . . . . . . . . . . . . . .. . . 1792.8.14 Notes and Observations . . . . . . . . . . . . . . . . . . . . . . . . .. . 182

2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1832.9.1 Introduction to Pipelining . . . . . . . . . . . . . . . . . . . . . .. . . . 1832.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 1862.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

2.10 Design Example: Pipelined Massey . . . . . . . . . . . . . . . . . .. . . . . . . 1882.11 Memory Arrays and RTL Design . . . . . . . . . . . . . . . . . . . . . . .. . . . 192

2.11.1 Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1922.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . 193

2.11.2.1 Using a Two-Dimensional Array for Memory . . . . . . . .. . 193

vi CONTENTS

2.11.2.2 Memory Arrays in Hardware . . . . . . . . . . . . . . . . . . . 1942.11.2.3 VHDL Code for Single-Port Memory Array . . . . . . . . . .. 1952.11.2.4 Using Library Components for Memory . . . . . . . . . . . .. 1962.11.2.5 Build Memory from Slices . . . . . . . . . . . . . . . . . . . . 1972.11.2.6 Dual-Ported Memory . . . . . . . . . . . . . . . . . . . . . . . 199

2.11.3 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1992.11.4 Memory Arrays and Dataflow Diagrams . . . . . . . . . . . . . . .. . . . 2012.11.5 Example: Memory Array and Dataflow Diagram . . . . . . . . .. . . . . 204

2.12 Input / Output Protocols . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 2062.13 Example: Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 207

2.13.1 Requirements and Environmental Assumptions . . . . . .. . . . . . . . . 2072.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2072.13.3 Pseudocode and Dataflow Diagrams . . . . . . . . . . . . . . . . .. . . . 2102.13.4 Control Tables and State Machine . . . . . . . . . . . . . . . . .. . . . . 2162.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

2.14 Design Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 221P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

P2.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 221P2.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . 221

P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221P2.3 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . .. . 222

P2.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . 222P2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

P2.4 Dataflow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . 223P2.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . 223P2.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

P2.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . .. . . . 224P2.6 Dataflow Diagrams with Memory Arrays . . . . . . . . . . . . . . . .. . 224

P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225P2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225P2.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . 225P2.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .226

CONTENTS vii

3 Performance Analysis and Optimization 2273.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2273.2 Defining Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 2273.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 228

3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2283.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . . .. . . . . 229

3.4 Clock Speed, CPI, Program Length, and Performance . . . . .. . . . . . . . . . . 2333.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2333.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . . . . . . .2333.4.3 Effect of Instruction Set on Performance . . . . . . . . . . .. . . . . . . . 2353.4.4 Effect of Time to Market on Relative Performance . . . . .. . . . . . . . 2373.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .238

3.5 Performance Analysis and Dataflow Diagrams . . . . . . . . . . .. . . . . . . . . 2393.5.1 Dataflow Diagrams, CPI, and Clock Speed . . . . . . . . . . . . .. . . . 2393.5.2 Examples of Dataflow Diagrams for Two Instructions . . .. . . . . . . . . 240

3.5.2.1 Scheduling of Operations for Different Clock Periods . . . . . . 2413.5.2.2 Performance Computation for Different Clock Periods . . . . . . 2413.5.2.3 Example: Two Instructions Taking Similar Time . . . .. . . . . 2423.5.2.4 Example: Same Total Time, Different Order for A . . . .. . . . 243

3.5.3 Example: From Algorithm to Optimized Dataflow . . . . . . .. . . . . . 2443.6 General Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 252

3.6.1 Strength Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2523.6.1.1 Arithmetic Strength Reduction . . . . . . . . . . . . . . . . .. 2523.6.1.2 Boolean Strength Reduction . . . . . . . . . . . . . . . . . . . .252

3.6.2 Replication and Sharing . . . . . . . . . . . . . . . . . . . . . . . . .. . 2533.6.2.1 Mux-Pushing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2533.6.2.2 Common Subexpression Elimination . . . . . . . . . . . . . .. 2533.6.2.3 Computation Replication . . . . . . . . . . . . . . . . . . . . . 253

3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2543.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2543.8 Performance Analysis and Optimization Problems . . . . . .. . . . . . . . . . . . 256

P3.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256P3.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

P3.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . 257P3.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . 257

P3.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . .. 257P3.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

P3.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258P3.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . 258P3.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

P3.5 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . .. . 258P3.6 Performance Optimization with Memory Arrays . . . . . . . .. . . . . . 259P3.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 260

P3.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . 260P3.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 261

viii CONTENTS

4 Functional Verification 2634.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 263

4.1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2634.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

4.2.1 Terminology: Validation / Verification / Testing . . . .. . . . . . . . . . . 2644.2.2 The Difficulty of Designing Correct Chips . . . . . . . . . . .. . . . . . . 265

4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . . . . . . . . . 2654.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys) . . . 265

4.3 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 2664.3.1 Test Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2664.3.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2674.3.3 Floating Point Divider Example . . . . . . . . . . . . . . . . . . .. . . . 268

4.4 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2704.4.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . . . . . . . .. 2714.4.2 Reference Model Style Testbench . . . . . . . . . . . . . . . . . .. . . . 2724.4.3 Relational Style Testbench . . . . . . . . . . . . . . . . . . . . . .. . . . 2724.4.4 Coding Structure of a Testbench . . . . . . . . . . . . . . . . . . .. . . . 2734.4.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2734.4.6 Verification Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274

4.5 Functional Verification for Datapath Circuits . . . . . . . .. . . . . . . . . . . . . 2744.5.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2754.5.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . .. . 2764.5.3 Build Spec into Stimulus . . . . . . . . . . . . . . . . . . . . . . . . .. . 2774.5.4 Have Separate Specification Entity . . . . . . . . . . . . . . . .. . . . . . 2784.5.5 Generate Test Vectors Automatically . . . . . . . . . . . . . .. . . . . . . 2804.5.6 Relational Specification . . . . . . . . . . . . . . . . . . . . . . . .. . . 280

4.6 Functional Verification of Control Circuits . . . . . . . . . .. . . . . . . . . . . . 2814.6.1 Overview of Queues in Hardware . . . . . . . . . . . . . . . . . . . .. . 2814.6.2 VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

4.6.2.1 Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834.6.2.2 Other VHDL Coding . . . . . . . . . . . . . . . . . . . . . . . 283

4.6.3 Code Structure for Verification . . . . . . . . . . . . . . . . . . .. . . . . 2834.6.4 Instrumentation Code . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 2844.6.5 Coverage Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2844.6.6 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2874.6.7 VHDL Coding Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2884.6.8 Queue Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2894.6.9 Queue Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

4.7 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 2914.8 Functional Verification Problems . . . . . . . . . . . . . . . . . . .. . . . . . . . 296

P4.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296P4.2 Traffic Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . .. . 296

P4.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296P4.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . 296P4.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

CONTENTS ix

P4.3 State Machines and Verification . . . . . . . . . . . . . . . . . . . .. . . 297P4.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . .297P4.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . 298

P4.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298P4.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299P4.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

P4.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .299

5 Timing Analysis 3015.1 Delays and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 301

5.1.1 Background Definitions . . . . . . . . . . . . . . . . . . . . . . . . . .. 3015.1.2 Clock-Related Timing Definitions . . . . . . . . . . . . . . . . .. . . . . 302

5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3025.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 3035.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

5.1.3 Storage-Related Timing Definitions . . . . . . . . . . . . . . .. . . . . . 3045.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . . . . . . . . . 3045.1.3.2 Timing Parameters for a Flop . . . . . . . . . . . . . . . . . . . 3055.1.3.3 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3055.1.3.4 Clock-to-Q Time . . . . . . . . . . . . . . . . . . . . . . . . . . 305

5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3065.1.4.1 Load Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3065.1.4.2 Interconnect Delays . . . . . . . . . . . . . . . . . . . . . . . . 306

5.1.5 Summary of Delay Factors . . . . . . . . . . . . . . . . . . . . . . . . .. 3075.1.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 307

5.1.6.1 Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . . 3085.1.6.2 Hold Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . 3095.1.6.3 Example Timing Violations . . . . . . . . . . . . . . . . . . . . 309

5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . .. . . . . . . . . 3115.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . .. . 311

5.2.1.1 Structure and Behaviour of Multiplexer Latch . . . . .. . . . . 3115.2.1.2 Strategy for Timing Analysis of Storage Devices . . .. . . . . . 3135.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . . . . . . .. . 3145.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . .. 3155.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . . . . . . . . .3235.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . . . . . . . . . 326

5.2.2 Timing Analysis of Transmission-Gate Latch . . . . . . . .. . . . . . . . 3265.2.2.1 Structure and Behaviour of a Transmission Gate . . . .. . . . . 3275.2.2.2 Structure and Behaviour of Transmission-Gate Latch . . . . . . 3275.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch . . . .. . . . . 3285.2.2.4 Setup and Hold Times for Transmission-Gate Latch . .. . . . . 328

5.2.3 Falling Edge Flip Flop . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3285.2.3.1 Structure and Behaviour of Flip-Flop . . . . . . . . . . . .. . . 3295.2.3.2 Clock-to-Q of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . 3305.2.3.3 Setup of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 331

x CONTENTS

5.2.3.4 Hold of Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . 3325.2.4 Timing Analysis of FPGA Cells . . . . . . . . . . . . . . . . . . . . .. . 332

5.2.4.1 Standard Timing Equations . . . . . . . . . . . . . . . . . . . . 3335.2.4.2 Hierarchical Timing Equations . . . . . . . . . . . . . . . . .. 3335.2.4.3 Actel Act 2 Logic Cell . . . . . . . . . . . . . . . . . . . . . . . 3335.2.4.4 Timing Analysis of Actel Sequential Module . . . . . . .. . . . 335

5.2.5 Exotic Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3365.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 336

5.3.1 Introduction to Critical and False Paths . . . . . . . . . . .. . . . . . . . 3365.3.1.1 Example of Critical Path in Full Adder . . . . . . . . . . . .. . 3385.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . . . . . .. . . 3405.3.1.3 Longest Path and Critical Path . . . . . . . . . . . . . . . . . .3405.3.1.4 Timing Simulation vs Static Timing Analysis . . . . . .. . . . . 343

5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3435.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . . . . . . . .. 345

5.3.3.1 Preliminaries for Detecting a False Path . . . . . . . . .. . . . 3455.3.3.2 Almost-Correct Algorithm to Detect a False Path . . .. . . . . . 3495.3.3.3 Examples of Detecting False Paths . . . . . . . . . . . . . . .. 349

5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . . . . . .. . . . 3545.3.4.1 Algorithm to Find Next Candidate Path . . . . . . . . . . . .. . 3545.3.4.2 Examples of Finding Next Candidate Path . . . . . . . . . .. . 355

5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . .. . . . . . . 3625.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . . . . . . . . . 3625.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . . . . . . . . . 3645.3.5.3 Analysis of Side-Input-Causes-Glitch Situation .. . . . . . . . 3655.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 3665.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . . . . . . . 367

5.3.6 Further Extensions to Critical Path Analysis . . . . . . .. . . . . . . . . . 3745.3.7 Increasing the Accuracy of Critical Path Analysis . . .. . . . . . . . . . . 375

5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3755.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . . .. . . . 3755.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . . .. . . . 380

5.4.2.1 Example Derivation: Equation for Voltage at Node 3 .. . . . . . 3825.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . . . . . . . 383

5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3855.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . . . . .. . 387

5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . . . . . .. . 3875.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . . .. . . . . 389

5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . .. . . . . . . 3925.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . . . . .. . 3945.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394

5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3945.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 394

5.6 Timing Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 396

CONTENTS xi

P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396

P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397P5.2.3 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . .. . 398P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 399P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . 399

P5.6 YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . . .. . . 400P5.7 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401P5.8 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

P5.8.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 402P5.8.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 402P5.8.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . 402

P5.9 Worst Case Conditions and Derating Factor . . . . . . . . . . .. . . . . . 402P5.9.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . 402P5.9.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . 402P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . 402

6 Power Analysis and Power-Aware Design 4036.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

6.1.1 Importance of Power and Energy . . . . . . . . . . . . . . . . . . . .. . . 4036.1.2 Industrial Names and Products . . . . . . . . . . . . . . . . . . . .. . . . 4036.1.3 Power vs Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4046.1.4 Batteries, Power and Energy . . . . . . . . . . . . . . . . . . . . . .. . . 405

6.1.4.1 Do Batteries Store Energy or Power? . . . . . . . . . . . . . .. 4056.1.4.2 Battery Life and Efficiency . . . . . . . . . . . . . . . . . . . . 4056.1.4.3 Battery Life and Power . . . . . . . . . . . . . . . . . . . . . . 406

6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4096.2.1 Switching Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4106.2.2 Short-Circuited Power . . . . . . . . . . . . . . . . . . . . . . . . . .. . 4116.2.3 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4116.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4126.2.5 Note on Power Equations . . . . . . . . . . . . . . . . . . . . . . . . . .. 413

6.3 Overview of Power Reduction Techniques . . . . . . . . . . . . . .. . . . . . . . 4146.4 Voltage Reduction for Power Reduction . . . . . . . . . . . . . . .. . . . . . . . 4156.5 Data Encoding for Power Reduction . . . . . . . . . . . . . . . . . . .. . . . . . 416

6.5.1 How Data Encoding Can Reduce Power . . . . . . . . . . . . . . . . .. . 4166.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . .. . . . 419

6.5.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 4196.5.2.2 Additional Information . . . . . . . . . . . . . . . . . . . . . . 420

xii CONTENTS

6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4206.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424

6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . . . . . . . .. . . . 4246.6.2 Implementing Clock Gating . . . . . . . . . . . . . . . . . . . . . . .. . 4256.6.3 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4266.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . . . . . . . .. . . . 4276.6.5 Example: Reduced Activity Factor with Clock Gating . .. . . . . . . . . 4296.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . . . . . .. . . . . . 431

6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . . . . . . . . . . . 4316.6.6.2 How Many Clock Cycles for Module? . . . . . . . . . . . . . . 4336.6.6.3 Adding Clock-Gating Circuitry . . . . . . . . . . . . . . . . .. 434

6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . .. . . . . . . . . 4376.7 Power Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 439

P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439P6.1.1 Power and Temperature . . . . . . . . . . . . . . . . . . . . . . 439P6.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . 439P6.1.3 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439P6.1.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439

P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439P6.2.1 Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . 439P6.2.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .440P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440P6.5 Clock Speed Increase Without Power Increase . . . . . . . . .. . . . . . 441

P6.5.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441P6.5.2 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

P6.6 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . .. . 441P6.6.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 441P6.6.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . 441P6.6.3 Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . 441P6.6.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

P6.7 Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . .442P6.7.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

CONTENTS xiii

7 Fault Testing and Testability 4437.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 443

7.1.1 Overview of Faults and Testing . . . . . . . . . . . . . . . . . . . .. . . 4437.1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4437.1.1.2 Causes of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . 4437.1.1.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4447.1.1.4 Burn In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4447.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4447.1.1.6 Testing Techniques . . . . . . . . . . . . . . . . . . . . . . . . 4457.1.1.7 Design for Testability (DFT) . . . . . . . . . . . . . . . . . . .445

7.1.2 Example Problem: Economics of Testing . . . . . . . . . . . . .. . . . . 4467.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447

7.1.3.1 Types of Physical Faults . . . . . . . . . . . . . . . . . . . . . . 4477.1.3.2 Locations of Faults . . . . . . . . . . . . . . . . . . . . . . . . 4477.1.3.3 Layout Affects Locations . . . . . . . . . . . . . . . . . . . . . 4487.1.3.4 Naming Fault Locations . . . . . . . . . . . . . . . . . . . . . . 448

7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4487.1.4.1 Which Test Vectors will Detect a Fault? . . . . . . . . . . .. . 449

7.1.5 Mathematical Models of Faults . . . . . . . . . . . . . . . . . . . .. . . 4507.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . 450

7.1.6 Generate Test Vector to Find a Mathematical Fault . . . .. . . . . . . . . 4517.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4517.1.6.2 Example of Finding a Test Vector . . . . . . . . . . . . . . . . .452

7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 4527.1.7.1 Redundant Circuitry . . . . . . . . . . . . . . . . . . . . . . . . 4527.1.7.2 Curious Circuitry and Fault Detection . . . . . . . . . . .. . . 454

7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 4557.2.1 A Small Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4557.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . .. 455

7.2.2.1 Fault Domination . . . . . . . . . . . . . . . . . . . . . . . . . 4567.2.2.2 Fault Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 4577.2.2.3 Gate Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 4577.2.2.4 Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . 4587.2.2.5 Fault Collapsing Summary . . . . . . . . . . . . . . . . . . . . 458

7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4597.2.4 Test Vector Generation and Fault Detection . . . . . . . . .. . . . . . . . 4597.2.5 Generate Test Vectors for 100% Coverage . . . . . . . . . . . .. . . . . . 459

7.2.5.1 Collapse the Faults . . . . . . . . . . . . . . . . . . . . . . . . 4607.2.5.2 Check for Fault Domination . . . . . . . . . . . . . . . . . . . . 4627.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 4637.2.5.4 Faults Not Covered by Required Test Vectors . . . . . . .. . . . 4637.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . . . . . . . 4647.2.5.6 Summary of Technique to Find and Order Test Vectors .. . . . 4657.2.5.7 Complete Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 466

7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . . . . . .. 467

xiv CONTENTS

7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 4677.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . .. . . . . . . 4687.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . . . . . . . 4687.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . . . . . . . 4697.3.2.3 Scan in Operation with Example Circuit . . . . . . . . . . .. . 470

7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . . . . . . .4757.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476

7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . . . . . . . . 4767.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 477

7.4.1 Boundary Scan History . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4777.4.2 JTAG Scan Pins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4787.4.3 Scan Registers and Cells . . . . . . . . . . . . . . . . . . . . . . . . .. . 4787.4.4 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4797.4.5 TAP Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4797.4.6 Other descriptions of JTAG/IEEE 1194.1 . . . . . . . . . . . .. . . . . . 480

7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 4817.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4817.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . . . . .. . . 4837.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . . . . . . 484

7.5.2 Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4857.5.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4867.5.4 Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4867.5.5 Arithmetic over Binary Fields . . . . . . . . . . . . . . . . . . . .. . . . 4877.5.6 Shift Registers and Characteristic Polynomials . . . .. . . . . . . . . . . 487

7.5.6.1 Circuit Multiplication . . . . . . . . . . . . . . . . . . . . . . .4897.5.7 Bit Streams and Characteristic Polynomials . . . . . . . .. . . . . . . . . 4897.5.8 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4897.5.9 Signature Analysis: Math and Circuits . . . . . . . . . . . . .. . . . . . . 4907.5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4967.7 Problems on Faults, Testing, and Testability . . . . . . . . .. . . . . . . . . . . . 497

P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . .. . . 497P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . .. 497P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . 498P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . .. . . 498P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . .. . . 498P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 498P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . .. 498

P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . 499P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . 499

P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . .499

CONTENTS xv

P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 500P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . 500P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . .500P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . .500P7.9.6 Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . 500P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . 500

P7.10 Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500P7.11 Timing Hazards and Testability . . . . . . . . . . . . . . . . . . .. . . . 501P7.12 Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 501

P7.12.1 Are there any physical faults that are detectable byscan testingbut not by built-in self testing? . . . . . . . . . . . . . . . . . . 501

P7.12.2 Are there any physical faults that are detectable bybuilt-in selftesting but not by scan testing? . . . . . . . . . . . . . . . . . . 501

P7.13 Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501P7.13.1 Design test generator . . . . . . . . . . . . . . . . . . . . . . . 502P7.13.2 Design signature analyzer . . . . . . . . . . . . . . . . . . . . .502P7.13.3 Determine if a fault is detectable . . . . . . . . . . . . . . .. . 502P7.13.4 Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502

8 Review 5038.1 Overview of the Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 5038.2 VHDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

8.2.1 VHDL Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5048.2.2 VHDL Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 504

8.3 RTL Design Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 5058.3.1 Design Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5058.3.2 Design Example Problems . . . . . . . . . . . . . . . . . . . . . . . . .. 505

8.4 Functional Verification . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 5068.4.1 Verification Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5068.4.2 Verification Example Problems . . . . . . . . . . . . . . . . . . . .. . . . 506

8.5 Performance Analysis and Optimization . . . . . . . . . . . . . .. . . . . . . . . 5078.5.1 Performance Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5078.5.2 Performance Example Problems . . . . . . . . . . . . . . . . . . . .. . . 507

8.6 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 5088.6.1 Timing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5088.6.2 Timing Example Problems . . . . . . . . . . . . . . . . . . . . . . . . .. 508

8.7 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5098.7.1 Power Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5098.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . .509

8.8 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5108.8.1 Testing Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5108.8.2 Testing Example Problems . . . . . . . . . . . . . . . . . . . . . . . .. . 510

8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . .. . . . . 511

xvi CONTENTS

II Solutions to Assignment Problems 1

1 VHDL Problems 3P1.1 IEEE 1164 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3P1.2 VHDL Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4P1.3 Flops, Latches, and Combinational Circuitry . . . . . . . .. . . . . . . . . . . . . 7P1.4 Counting Clock Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 9P1.5 Arithmetic Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 11P1.6 Delta-Cycle Simulation: Pong . . . . . . . . . . . . . . . . . . . . .. . . . . . . 13P1.7 Delta-Cycle Simulation: Baku . . . . . . . . . . . . . . . . . . . . .. . . . . . . 14P1.8 Clock-Cycle Simulation . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 17P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl . . . . . .. . . . . . . . . 20P1.10VHDL — VHDL Behavioural Comparison: Ichtyostega . . . .. . . . . . . . . . 21P1.11Waveform — VHDL Behavioural Comparison . . . . . . . . . . . .. . . . . . . . 23P1.12Hardware — VHDL Comparison . . . . . . . . . . . . . . . . . . . . . . .. . . 25P1.138-Bit Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 27

P1.13.1Asynchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 27P1.13.2Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28P1.13.3Testbench for Register . . . . . . . . . . . . . . . . . . . . . . . .. . . . 28

P1.14Synthesizable VHDL and Hardware . . . . . . . . . . . . . . . . . .. . . . . . . 30P1.15Datapath Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 32

P1.15.1Correct Implementation? . . . . . . . . . . . . . . . . . . . . . .. . . . . 32P1.15.2Smallest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36P1.15.3Shortest Clock Period . . . . . . . . . . . . . . . . . . . . . . . . .. . . 37

2 Design Problems 39P2.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39

P2.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39P2.1.2 Own Code vs Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39

P2.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 39P2.3 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . .. . . . . . 42

P2.3.1 Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42P2.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

P2.4 Dataflow Diagram Design . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 44P2.4.1 Maximum Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .44P2.4.2 Minimum area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

P2.5 Michener: Design and Optimization . . . . . . . . . . . . . . . . .. . . . . . . . 47P2.6 Dataflow Diagrams with Memory Arrays . . . . . . . . . . . . . . . .. . . . . . 48

P2.6.1 Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49P2.6.2 Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

P2.7 2-bit adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 52P2.7.1 Generic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52P2.7.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

P2.8 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 53

CONTENTS xvii

3 Functional Verification Problems 55P3.1 Carry Save Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 55P3.2 Traffic Light Controller . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 55

P3.2.1 Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 55P3.2.2 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 56P3.2.3 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56

P3.3 State Machines and Verification . . . . . . . . . . . . . . . . . . . .. . . . . . . 57P3.3.1 Three Different State Machines . . . . . . . . . . . . . . . . . .. . . . . 57

P3.3.1.1 Number of Test Scenarios . . . . . . . . . . . . . . . . . . . . . 57P3.3.1.2 Length of Test Scenario . . . . . . . . . . . . . . . . . . . . . . 58P3.3.1.3 Number of Flip Flops . . . . . . . . . . . . . . . . . . . . . . . 58

P3.3.2 State Machines in General . . . . . . . . . . . . . . . . . . . . . . .. . . 59P3.4 Test Plan Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 59

P3.4.1 Early Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60P3.4.2 Corner Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

P3.5 Sketches of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 62

4 Performance Analysis and Optimization Problems 63P4.1 Farmer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63P4.2 Network and Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 64

P4.2.1 Maximum Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . .65P4.2.2 Packet Size and Performance . . . . . . . . . . . . . . . . . . . . .. . . . 66

P4.3 Performance Short Answer . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 66P4.4 Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 67

P4.4.1 Average CPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67P4.4.2 Why not you too? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68P4.4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

P4.5 Dataflow Diagram Optimization . . . . . . . . . . . . . . . . . . . . .. . . . . . 70P4.6 Performance Optimization with Memory Arrays . . . . . . . .. . . . . . . . . . . 70P4.7 Multiply Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 75

P4.7.1 Highest Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 76P4.7.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 78

xviii CONTENTS

5 Timing Analysis Problems 79P5.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 79P5.2 Hold Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 80

P5.2.1 Cause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80P5.2.2 Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80P5.2.3 Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

P5.3 Latch Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 81P5.4 Critical Path and False Path . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 83P5.5 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 83

P5.5.1 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84P5.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84P5.5.3 Missing Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84P5.5.4 Critical Path or False Path? . . . . . . . . . . . . . . . . . . . . .. . . . . 85

P5.6 YACP: Yet Another Critical Path . . . . . . . . . . . . . . . . . . . .. . . . . . . 86P5.7 Timing Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 87P5.8 Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89

P5.8.1 Wires in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89P5.8.2 Age and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89P5.8.3 Temperature and Delay . . . . . . . . . . . . . . . . . . . . . . . . . .. . 89

P5.9 Worst Case Conditions and Derating Factor . . . . . . . . . . .. . . . . . . . . . 90P5.9.1 Worst-Case Commercial . . . . . . . . . . . . . . . . . . . . . . . . .. . 90P5.9.2 Worst-Case Industrial . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 90P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature . . . . . . . . . 90

CONTENTS xix

6 Power Problems 91P6.1 Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 91

P6.1.1 Power and Temperature . . . . . . . . . . . . . . . . . . . . . . . . . .. 91P6.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92P6.1.3 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92P6.1.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

P6.2 VLSI Gurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93P6.2.1 Effect on Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93P6.2.2 Critique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

P6.3 Advertising Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 94P6.4 Vary Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 94P6.5 Clock Speed Increase Without Power Increase . . . . . . . . .. . . . . . . . . . . 95

P6.5.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95P6.5.2 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

P6.6 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . 96P6.6.1 Supply Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97P6.6.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 97P6.6.3 Adding Registers to Inputs . . . . . . . . . . . . . . . . . . . . . .. . . . 97P6.6.4 Gray Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

P6.7 Power Consumption on New Chip . . . . . . . . . . . . . . . . . . . . . .. . . . 98P6.7.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98P6.7.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99P6.7.3 Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xx CONTENTS

7 Problems on Faults, Testing, and Testability 101P7.1 Based on Smith q14.9: Testing Cost . . . . . . . . . . . . . . . . . .. . . . . . . 101P7.2 Testing Cost and Total Cost . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 103P7.3 Minimum Number of Faults . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 104P7.4 Smith q14.10: Fault Collapsing . . . . . . . . . . . . . . . . . . . .. . . . . . . . 105P7.5 Mathematical Models and Reality . . . . . . . . . . . . . . . . . . .. . . . . . . 105P7.6 Undetectable Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 105P7.7 Test Vector Generation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 106

P7.7.1 Choice of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . .. . 106P7.7.2 Number of Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . .. 106

P7.8 Time to do a Scan Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 106P7.9 BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

P7.9.1 Characteristic Polynomials . . . . . . . . . . . . . . . . . . . .. . . . . . 107P7.9.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 108P7.9.3 Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 110P7.9.4 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . .. . . . . 113P7.9.5 Probabilty of Catching a Fault . . . . . . . . . . . . . . . . . . .. . . . . 114P7.9.6 Detecting a Specific Fault . . . . . . . . . . . . . . . . . . . . . . .. . . 114P7.9.7 Time to Run Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

P7.10Power and BIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 116P7.11Timing Hazards and Testability . . . . . . . . . . . . . . . . . . .. . . . . . . . . 116P7.12Testing Short Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 118

P7.12.1Are there any physical faults that are detectable byscan testing but not bybuilt-in self testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

P7.12.2Are there any physical faults that are detectable bybuilt-in self testing butnot by scan testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

P7.13Fault Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 119P7.13.1Design test generator . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 119P7.13.2Design signature analyzer . . . . . . . . . . . . . . . . . . . . .. . . . . 119P7.13.3Determine if a fault is detectable . . . . . . . . . . . . . . .. . . . . . . . 120P7.13.4Testing time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 120

Part I

Course Notes

1

Chapter 1

VHDL: The Language

1.1 Introduction to VHDL

1.1.1 Levels of Abstraction

There are many different levels of abstraction for working with hardware:

• Quantum: Schrodinger’s equations describe movement of electrons and holes through mate-rial.

• Energy band: 2-dimensional diagrams that capture essential features of Schrodinger’s equa-tions. Energy-band diagrams are commonly used in nano-scale engineering.

• Transistor: Signal values and time are continous (analog).Each transistor is modeled by aresistor-capacitor network. Overall behaviour is defined by differential equations in terms ofthe resistors and capacitors. Spice is a typical simulationtool.

• Switch: Time is continuous, but voltage may be either continuous or discrete. Linear equa-tions are used, rather than differential equations. A rising edge may be modeled as a linearrise over some range of time, or the time between a definite lowvalue and a definite highvalue may be modeled as having an undefined or rising value.

• Gate: Transistors are grouped together into gates (e.g.AND, OR, NOT). Voltages are discretevalues such as pure Boolean (0 or 1) or IEEE Standard Logic 1164, which has representationsfor different types of unknown or undefined values. Time may be continuous or may bediscrete. If discrete, a common unit is the delay through a single inverter (e.g. aNOT gatehas a delay of 1 andAND gate has a delay of 2).

• Register transfer level: The essential characteristic of the register transfer level is that thebehaviour of hardware is modeled as assignments to registers and combinational signals.Equations are written where a register signal is a function of other signals (e.g.c = a

3

4 CHAPTER 1. VHDL

and b; ). The assignments may be either combinational or registered. Combinational as-signments happen instanteously and registered assignments take exactly one clock cycle.There are variations on the pure register-transfer level. For example, time may be measuredin clock phases rather than clock cycles, so as to allow assignments on either the rising orfalling edge of a clock. Another variation is to have multiple clocks that run at differentspeeds — a clock on a bus might run at half the speed of the primary clock for the chip.

• Transaction level: The basic unit of computation is a transaction, such as executing an in-struction on a microprocessor, transfering data across a bus, or accessing memory. Timeis usually measured as an estimate (e.g. a memory write requires 15 clock cycles, or abus transfer requires 250 ns.). The building blocks of the transaction level are processors,controllers, memory arrays, busses, intellectual property (IP) blocks (e.g. UARTs). Thebehaviour of the building blocks are described with software-like models, often written inbehavioural VHDL, SystemC, or SystemVerilog. The transaction level has many similaritiesto a software model of a distributed system.

• Electronic-system level: Looks at an entire electronic system, with both hardware and soft-ware.

In this course, we will focus on the register-transfer level. In the second half of the course, we willlook at how analog phenomenon, such as timing and power, affect the register-transfer level. Inthese chapters we will occasionally dip down into the transistor, switch, and gate levels.

1.1.2 VHDL Origins and History

VHDL = VHSIC Hardware Description Language

VHSIC = Very High Speed Integrated Circuit

The VHSIC Hardware Description Language (VHDL) is a formal notation intendedfor use in all phases of the creation of electronic systems. Because it is both machinereadable and human readable, it supports the development, verification, synthesis andtesting of hardware designs, the communication of hardwaredesign data, and themaintenance, modification, and procurement of hardware.

Language Reference Manual(IEEE Design Automation Standards Committee,1993a)

• development

• verification

• synthesis

• testing

• hardware designs

• communication

• maintenance

• modification

1.1.2 VHDL Origins and History 5

• procurement

VHDL is a lot more than synthesis of digitalhardware

VHDL History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

• Developed by the United States Department of Defense as partof the very high speed integratedcircuit (VHSIC) program in the early 1980s.

• The Department of Defense intended VHDL to be used for the documentation, simulation andverification of electronic systems.

• Goals:

– improve design process over schematic entry

– standardize design descriptions amongst multiple vendors

– portable and extensible

• Inspired by the ADA programming language

– large: 97 keywords, 94 syntactic rules

– verbose (designed by committee)

– static type checking, overloading

– complicated syntax: parentheses are used for both expression grouping and array indexing

Example:

a <= b * (3 + c); -- integera <= (3 + c); -- 1-element array of integers

• Standardized by IEEE in 1987 (IEEE 1076-1987), revised in 1993, 2000.

• In 1993 the IEEE standard VHDL package for model interoperability, STD_LOGIC_1164(IEEE Standard 1164-1993), was developed.

– std_logic_1164 defines 9 different values for signals

• In 1997 the IEEE standard packages for arithmetic overstd logic andbit signals weredefined (IEEE Standard 1076.3–1997).

– numeric_std defines arithmetic overstd logic vector s and integers.

Note: This is the package that you should use for arithmetic. Don’tusestd logic arith — it has less uniform support for mixed inte-ger/signal arithmetic and has a greater tendency for differences betweentools.

– numeric_bit defines arithmetic overbit vector s and integers. We won’t usebitsignals in this course, so you don’t need to worry about this package.

6 CHAPTER 1. VHDL

1.1.3 Semantics

The original goal of VHDL was to simulate circuits. The semantics of the language define circuitbehaviour.

a

b

c

simulationc <= a AND b;

But now, VHDL is used in simulation and synthesis. Synthesisis concerned with thestructure ofthe circuit.

Synthesis: converts one type of description (behavioural) into another, lower level, description(usually a netlist).

a

b cc <= a AND b; synthesis

Synthesis is a computer-aided design (CAD) technique that transforms a designer’s concise, high-level description of a circuit into a structural description of a circuit.

CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

CAD Tools allow designers to automate lower-level design processes in implementing the desiredfunctionality of a system.

NOTE: EDA = Electronic Design Automation. In digital hardware design EDA = CAD.

Synthesis vs Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

For synthesis, we want the code we write to define the structure of the hardware that is generated.a

b cc <= a AND b; synthesis

1.1.4 Synthesis of a Simulation-Based Language 7

The VHDL semantics define thebehaviour of the hardware that is generated, not thestructureof the hardware. The scenario below complies with the semantics of VHDL, because the twosynthesized circuits produce the same behaviour. If the twosynthesized circuits had differentbehaviour, then the scenario would not comply with the VHDL Standard.

a

b c

a

b c

c <= a AND b;

a

b

c

differentstructure

samebehavioursynthesis

simulation

a

b

c

simulationsy

nthe

sis

1.1.4 Synthesis of a Simulation-Based Language

• Not all of VHDL is synthesizable

– c <= a AND b; (synthesizable)

– c <= a AND b AFTER 2ns; (NOT synthesizable)

∗ how do you build a circuit with exactly 2ns of delay through anAND gate?

∗ more examples of non-synthesizable code are in section 1.11

– See section 1.11 for more details

• Different synthesis tools support different subsets of VHDL

• Some tools generate erroneous hardware for some code

– behaviour of hardware differs from VHDL semantics

• Some tools generate unpredictable hardware (Hardware thathas the correct behaviour, but un-desirable or weird structure).

• There is an IEEE standard (1076.6) for a synthesizable subset of VHDL, but tool vendors don’tyet conform to it. (Most vendors still don’t have full support for the 1993 extensions to VHDL!).For more info, seehttp://www.vhdl.org/siwg/ .

1.1.5 Solution to Synthesis Sanity

• Pick a high-quality synthesis tool and study its documentation thoroughly

• Learn the idioms of the tool

• Different VHDL code with same behaviour can result in very different circuits

• Be careful if you have to port VHDL code from one tool to another

8 CHAPTER 1. VHDL

• KISS: Keep It Simple Stupid– VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synop-

sys, Mentor Graphics, Altera, Xilinx, and most other companies as well.

– Follow the coding guidelines and examples from lecture

– As you write VHDL, think about the hardware you expect to get.

Note: If you can’t predict the hardware, then the hardware probablywon’t be very good (small, fast, correct, etc)

1.1.6 Standard Logic 1164

At the core of VHDL is a package namedSTANDARD that defines a type namedbit with valuesof ’0’ and ’1’ . For simulation, it helpful to have additional values, suchas “undefined” and“high impedance”. Many companies created their own (incompatible) definitions of signal typesfor simulation. To regain compatibility amongst packages from different companies, the IEEEdefinedstd logc 1164 to be the standard type for signal values in VHDL simulation.

’U’ uninitialized’X’ strong unknown’0’ strong 0’1’ strong 1’Z’ high impedance’W’ weak unknown’L’ weak 0’H’ weak 1’--’ don’t care

The most common values are:’U’ , ’X’ , ’0’ , ’1’ .

If you see’X’ in a simulation, it usually means that there is a mistake in your code.

Every VHDL file that you write should begin with:library ieee;use ieee.std_logic_1164.all;

Note: std logic vs boolean Thestd logic values’1’ and’0’ are notthe same as theboolean valuestrue andfalse. For example, you mustwriteif a = ’1’ then .... The codeif a then ... will not type-check ifa is of typestd logic.

From a VLSI perspective, a weak value will come from a smallergate. One aspect of VHDL thatwe don’t touch on in ece327 is ”resolution”, which describeshow to determine the value of a signalif the signal is driven by ¡b¿more than one¡/b¿ process. (In ece327, we restrict ourselves to havingeach signal be driven by (be the target of) exactly one process). The stdlogic 1164 library providesa resolution function to deal with situation where different processes drive the same signal withdifferent values. In this situation, a strong value (e.g. ’1’) will overpower a weak value (e.g. ’L’).If two processes drive the signal with different strong values (e.g. ’1’ and ’0’) the signal resolves

1.2. COMPARISON OF VHDL TO OTHER HARDWARE DESCRIPTION LANGUAGES 9

to a strong unknown (’X’). If a signal is driven with two different weak values (e.g. ’H’ and ’L’),the signal resolves to a weak unknown (’W’).

1.2 Comparison of VHDL to Other Hardware Description Lan-guages

1.2.1 VHDL Disadvantages

• Some VHDL programs cannot be synthesized

• Different tools support different subsets of VHDL.

• Different tools generate different circuits for same code

• VHDL is verbose

– Many characters to say something simple

• VHDL is complicated and confusing

– Many different ways of saying the same thing

– Constructs that have similar purpose have very different syntax (case vs. select)

– Constructs that have similar syntax have very different semantics (variables vs signals)

• Hardware that is synthesized is not always obvious (when is asignal a flip-flop vs latch vscombinational)

– The infamouslatch inferenceproblem (See section 1.5.2 for more information)

1.2.2 VHDL Advantages

• VHDL supports unsynthesizable constructs that are useful in writing high-level models, test-benches and other non-hardware or non-synthesizable artifacts that we need in hardware design.VHDL can be used throughout a large portion of the design process in different capacities, fromspecification to implementation to verification.

• VHDL has static typechecking — many errors can be caught before synthesis and/or simulation.(In this respect, it is more similar to Java than to C.)

• VHDL has a rich collection of datatypes

• VHDL is a full-featured language with a good module system (libraries and packages).

• VHDL has a well-defined standard.

10 CHAPTER 1. VHDL

1.2.3 VHDL and Other Languages

1.2.3.1 VHDL vs Verilog

• Verilog is a “simpler” language: smaller language, simple circuits are easier to write

• VHDL has more features than Verilog

– richer set of data types and strong type checking

– VHDL offers more flexibility and expressivity for constructing large systems.

• The VHDL Standard is more standard than the Verilog Standard

– VHDL and Verilog have simulation-based semantics

– Simulation vendors generally conform to VHDL standard

– Some Verilog constructs give different behaviours in simulation and synthesis

• VHDL is used more than Verilog in Europe and Japan

• Verilog is used more than VHDL in North America

• VHDL is used more in FPGAs than in ASICs

• South-East Asia, India, South America: ?????

1.2.3.2 VHDL vs System Verilog

• System Verilog is a superset of Verilog. It extends Verilog to make it a full object-orientedhardware modelling language

• Syntax is based on Verilog and C++.

• As of 2007, System Verilog is used almost exclusively for test benches and simulation. Veryfew people are trying to use it to do hardware design.

• System Verilog grew out of Superlog, a proposed language that was based on Verilog and C.Basic core came from Verilog. C-like extensions included tomake language more expressive andpowerful. Developed by originally the company Co-Design Automation and then standardizedby Accellera, an organization aimed at standardizing EDA languages. Co-Design was purchasedby Synopsys and now Synopsys is the leading proponent of System Verilog.

1.2.3.3 VHDL vs SystemC

• System C looks like C — familiar syntax

• C is often used in algorithmic descriptions of circuits, so why not try to use it for synthesizablecode as well?

• If you think VHDL is hard to synthesize, try C....

• SystemC simulation is slower than advertised

1.3. OVERVIEW OF SYNTAX 11

1.2.3.4 Summary of VHDL Evaluation

• VHDL is far from perfect and has lots of annoying characteristics

• VHDL is a better language for education than Verilog becausethe static typechecking enforcesgood software engineering practices

• The richness of VHDL will be useful in creating concise high-level models and powerful test-benches

1.3 Overview of Syntax

This section is just a brief overview of the syntax of VHDL, focusing on the constructs that aremost commonly used. For more information, read a book on VHDLand use online resources.(Look for “VHDL” under the “Documentation” tab in the E&C 327web pages.)

1.3.1 Syntactic Categories

There are five major categories of syntactic constructs.(There are many, many minor categories and subcategories of constructs.)

• Library units (section 1.3.2)

– Top-level constructs (packages, entities, architectures)

• Concurrent statements (section 1.3.4)

– Statements executed at the same time (in parallel)

• Sequential statements (section 1.3.7)

– Statements executed in series (one after the other)

• Expressions

– Arithmetic (section 1.10), Boolean, Vectors , etc

• Declarations

– Components , signals, variables, types, functions, ....

1.3.2 Library Units

Library units are the top-level syntactic constructs in VHDL. They are used to define and includelibraries, declare and implement interfaces, define packages of declarations and otherwise bindtogether VHDL code.• Package body

– define the contents of a library

• Packages

– determine which parts of the library are externally visible

12 CHAPTER 1. VHDL

• Use clause

– use a library in an entity/architecture or another package

– technically, use clauses are part of entities and packages,but they proceed the entity/packagekeyword, so we list them as top-level constructs

• Entity (section 1.3.3)

– define interface to circuit

• Architecture (section 1.3.3)

– define internal signals and gates of circuit

1.3.3 Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

architecture

entityarchitecture

entity

Figure 1.1: Entity and Architecture

• Entity: interface– names, modes (in / out), types of

externally visible signals of circuit

• Architecture: internals

– structure and behaviour of module

library ieee;use ieee.std_logic_1164.all;

entity and_or isport (

a, b, c : in std_logic ;z : out std_logic

);end and_or;

Figure 1.2: Example of an entity

1.3.3 Entities and Architecture 13

The syntax of VHDL is defined using a variation on Backus-Naurforms (BNF).

[ { use_clause } ]entity ENTITYID is

[ port ({ SIGNALID : ( in | out) TYPEID [ := expr ] ; }

);][ { declaration } ]

[ begin{ concurrent_statement } ]

end [ entity ] ENTITYID ;

Figure 1.3: Simplified grammar of entity

architecture main of and_or issignal x : std_logic;

beginx <= a AND b;z <= x OR (a AND c);

end main;

Figure 1.4: Example of architecture

[ { use_clause } ]architecture ARCHID of ENTITYID is

[ { declaration } ]begin

[ { concurrent_statement } ]end [ architecture ] ARCHID ;

Figure 1.5: Simplified grammar of architecture

14 CHAPTER 1. VHDL

1.3.4 Concurrent Statements

• Architecture s contain concurrent statements

• Concurrent statements execute in parallel (Figure1.6)

– Concurrent statements make VHDL fundamentally different from most software languages.

– Hardware (gates) naturally execute in parallel — VHDL mimics the behaviour of real hard-ware.

– At each infinitesimally small moment of time, each gate:

1. samples its inputs

2. computes the value of its output

3. drives the output

architecture main of bowser isbegin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2;end main;

architecture main of bowser isbegin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b;end main;

a

b z

x1 x2

Figure 1.6: The order of concurrent statements doesn’t matter

1.3.4 Concurrent Statements 15

conditional assignment . . . <= . . . when . . . else . . .;• normal assignment (. . . <= . . .)

• if-then-else style (useswhen)c <= a+b when sel=’1’ else a+c when sel=’0’ else "0000";

selected assignment with . . . select. . . <= . . . when . . . | . . .,

. . . when . . . | . . .,

. . .

. . . when . . . | . . .;

• case/switch style assignment

with color select d <= "00" when red , "01" when . . .;component instantiation . . .: . . . port map ( . . . => . . ., . . . );

• use an existing circuit

• section 1.3.5add1 : adder port map( a => f , b => g, s => h, co => i );

for-generate . . .: for . . . in . . . generate. . .

end generate;

• replicate some hardware

bgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate;if-generate . . .: if . . . generate

. . .end generate;

• conditionally create some hardware

okgen : if optgoal /= fast then generateresult <= ((a and b) or (d and not e)) or g;

end generate;fastgen : if optgoal = fast then generate

result <= ’1’;end generate;

process process . . . begin. . .

end process;

• the body of a process is executed sequentially

• Sections 1.3.6, 1.6

Figure 1.7: The most commonly used concurrent statements

16 CHAPTER 1. VHDL

1.3.5 Component Declaration and Instantiations

There are two different syntaxes for component declarationand instantiation. The VHDL-93 syn-tax is much more concise than the VHDL-87 syntax.

Not all tools support the VHDL-93 syntax. For E&CE 327, some of the tools that we use donotsupport the VHDL-93 syntax, so we are stuck with the VHDL-87 syntax.

1.3.6 Processes

• Processes are used to describe complex and potentially unsynthesizable behaviour

• A process is a concurrent statement (Section 1.3.4).

• The body of a process contains sequential statements (Section 1.3.7)

• Processes are the most complex and difficult to understand part of VHDL (Sections 1.5 and 1.6)

process (a, b, c)begin

y <= a AND b;if (a = ’1’) then

z1 <= b AND c;z2 <= NOT c;

elsez1 <= b OR c;z2 <= c;

end if;end process;

processbegin

y <= a AND b;z <= ’0’;wait until rising_edge(clk);if (a = ’1’) then

z <= ’1’;y <= ’0’;wait until rising_edge(clk);

elsey <= a OR b;

end if;end process;

Figure 1.8: Examples of processes

• Processes must haveeither a sensitivity list or at least one wait statement on each execution paththrough the process.

• Processescannot have both a sensitivity list and a wait statement.

Sensitivity List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Thesensitivity list contains the signals that are read in the process.

A process is executed when a signal in its sensitivity list changes value.

1.3.7 Sequential Statements 17

An important coding guideline to ensure consistent synthesis and simulation results is to includeall signals that are read in the sensitivity list. If you forget some signals, you will either end upwith unpredictable hardware and simulation results (different results from different programs) orundesirable hardware (latches where you expected purely combinational hardware). For more onthis topic, see sections 1.5.2 and 1.6.

There is one exception to this rule: for a process that implements a flip-flop with anif rising edgestatement, it is acceptable to include only the clock signalin the sensitivity list — other signalsmay be included, but are not needed.

[ PROCLAB : ] process ( sensitivity_list )[ { declaration } ]

begin{ sequential_statement }

end process [ PROCLAB ] ;

Figure 1.9: Simplified grammar of process

1.3.7 Sequential Statements

Used insideprocesses andfunctions .

wait wait until . . .;signal assignment . . . <= . . .;if-then-else if . . . then . . . elsif . . . end if;case case . . . is

when . . . | . . . => . . .;when . . . => . . .;

end case;loop loop . . . end loop;while loop while . . . loop . . . end loop;for loop for . . . in . . . loop . . . end loop;next next . . .;

Figure 1.10: The most commonly used sequential statements

18 CHAPTER 1. VHDL

1.3.8 A Few More Miscellaneous VHDL Features

Some constructs that are useful and will be described in later chapters and sections:

report : print a message on stderr while simulating

assert : assertions about behaviour of signals, very useful with report statements.

generics : parameters to an entity that are defined at elaboration time.

attributes : predefined functions for different datatypes. For example: high and low indices of avector.

1.4 Concurrent vs Sequential Statements

All concurrent assignments can be translated into sequential statements. But, not all sequentialstatements can be translated into concurrent statements.

1.4.1 Concurrent Assignment vs Process

The two code fragments below have identical behaviour:

architecture main of tiny isbegin

b <= a;end main;

architecture main of tiny isbegin

process (a) beginb <= a;

end process;end main;

1.4.2 Conditional Assignment vs If Statements

The two code fragments below have identical behaviour:

Concurrent Statements

t <= <val1> when <cond>else < val2>;

Sequential Statementsif < cond> then

t <= < val1>;else

t <= < val2>;end if

1.4.3 Selected Assignment vs Case Statement 19

1.4.3 Selected Assignment vs Case Statement

The two code fragments below have identical behaviour

Concurrent Statementswith < expr> selectt <= < val1> when <choices1>,

<val2> when <choices2>,<val3> when <choices3>;

Sequential Statementscase < expr> is

when <choices1> =>t <= < val1>;



end case;

1.4.4 Coding Style

Code that’s easy to write withsequentialstatements, but difficult withconcurrent:

Sequential Statements

case < expr> iswhen <choice1> =>

if < cond> theno <= <expr1>;

elseo <= <expr2>;

end if;when <choice2> =>

. . .end case;

Concurrent Statements

Overall structure:with < expr> select

t <= ... when < choice1>,... when < choice2>;

Failed attempt:with < expr> select

t <= -- want to write:-- < val1> when <cond>-- else < val2>-- but conditional assignment-- is illegal herewhen c1,. . .when c2;

Concurrent statement with correct behaviour, but messy:t <= < expr1> when ( expr = <choice1> AND <cond>)

else < expr2> when ( expr = <choice1> AND NOT <cond>)else . . .

;

20 CHAPTER 1. VHDL

1.5 Overview of Processes

Processes are the most difficult VHDL construct to understand. This section gives an overview ofprocesses. Section 1.6 gives the details of the semantics ofprocesses.• Within a process, statements are executedalmostsequentially

• Among processes, execution is done in parallel

• Remember: a process is a concurrent statement!

entity ENTITYID isinterface declarations

end ENTITYID;

architecture ARCHID of ENTITYID isbegin

concurrent statements ⇐=process beginsequential statements ⇐=

end process;concurrent statements ⇐=

end ARCHID;

Figure 1.11: Sequential statements in a process

Key concepts in VHDL semantics for processes:• VHDL mimics hardware

• Hardware (gates) execute in parallel

• Processes execute in parallel with each other

• All possible orders of executing processes must produce thesame simulation results (wave-forms)

• If a signal is not assigned a value, then it holds its previousvalue

All orders of executing concurrent statements mustproduce the same waveforms

It doesn’t matter whether you are running on a single-threaded operating system, on a multi-threaded operating system, on a massively parallel supercomputer, or on a special hardware emu-lator with one FPGA chip per VHDL process — all simulations must be the same.

These concepts are the motivation for the semantics of executing processes in VHDL (Section 1.6)and lead to the phenomenon oflatch-inference (Section 1.5.2).

1.5. OVERVIEW OF PROCESSES 21

architecture

procA: process

stmtA1;

stmtA2;

stmtA3;

end process;

procB: process

stmtB1;

stmtB2;

end process;

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

execution sequence

A1

A2

A3

B1

B2

single threaded:procA beforeprocB

single threaded:procB beforeprocA

multithreaded:procAandprocB in parallel

Figure 1.12: Different process execution sequences

Figure 1.13: All execution orders must have same behaviour

Sections 1.5.1–1.5.3 discuss the hardware generated by processes.

Sections 1.6–1.6.7 discuss the behaviour and execution of processes.

22 CHAPTER 1. VHDL

1.5.1 Combinational Process vs Clocked Process

Each well-written synthesizable process is either combinational or clocked. Some synthesizableprocesses that do not conform to our coding guidelines are both combinational and clocked. Forexample, in a flip-flop with an asynchronous reset, the outputis a combinational function of thereset signal and a clocked function of the data input signal.We will deal with only with processesthat follow our coding conventions, and so we will continue to say that each process is eithercombinationalxor clocked.

Combinational process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

• Executing the process takes part of one clock cycle• Target signals are outputs of combinational circuitry

• A combinational processesmust have asensitivity list• A combinational processmust not have anywait statements

• A combinational processmust not have anyrising_edge s, orfalling_edge s

• The hardware for a combinational process is just combinational circuitry

Clocked process: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

• Executing the process takes one (or more) clock cycles• Target signals are outputs of flops

• Process contains one or morewait or if rising edge statements• Hardware contains combinational circuitry and flip flops

Note: Clocked processes are sometimes called “sequential processes”,but this can be easily confused with “sequential statements”, so in E&CE 327we’ll refer to synthesizable processes as either “combinational” or “clocked”.

Example Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

Combinational Process

process (a,b,c)p1 <= a;if (b = c) then

p2 <= b;else

p2 <= a;end if;

end process;

1.5.2 Latch Inference 23

Clocked Processesprocessbegin

wait until rising_edge(clk);b <= a;

end process;process (clk)begin

if rising_edge(clk) thenb <= a;

end if;end process;

1.5.2 Latch Inference

The semantics of VHDL require that if a signal is assigned a value on some passes through aprocess and not on other passes, then on a pass through the process when the signal is not assigneda value, it must maintain its value from the previous pass.

process (a, b, c)begin

if (a = ’1’) thenz1 <= b;z2 <= b;

elsez1 <= c;

end if;end process;

a

b

c

z1

z2

Figure 1.14: Example of latch inference

When a signal’s value must be stored, VHDLinfers a latch or a flip-flopin the hardware to storethe value.

If you want a latch or a flip-flop for the signal, then latch inference is good.

If you want combinational circuitry, then latch inference is bad.

Loop, Latch, Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .

24 CHAPTER 1. VHDL

b

a

z

Combinational loop

b z

a EN

Latch

b z

a

D Q

Flip-flop

Question: Write VHDL code for each of the above circuits

Answer:

combinational loopif a = ’1’ then

z <= b;else

z <= z;end if;

latchif a = ’1’ then

z <= b;end if;

flopif rising edge(a) then

z <= b;end if;

Causes of Latch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

Usually, “latch inference” refers to the unintentional creation of latches.

The most common cause of unintended latch inference is missing assignments to signals in if-then-else and case statements.

Latch inference happens during elaboration. When using theSynopsys tools, look for:Inferred memory devices

in the output or log files.

1.5.3 Combinational vs Flopped Signals 25

1.5.3 Combinational vs Flopped Signals

Signals assigned to in combinational processes are combinational.

Signals assigned to in clocked processes are outputs of flip-flops.

1.6 Details of Process Execution

In this section we go through the detailed semantics of how processes execute. These semanticsform the foundation for the simulation and synthesis of VHDL. The semantics define the simulationbehaviour, and the duty of synthesis is to produce hardware that has the same behaviour as thesimulation of the original VHDL code.

1.6.1 Simple Simulation

Before diving into the details of processes, we briefly review gate-level simulation with a simpleexample, which we will then explore in excruciating detail through the semantics of VHDL.

With knowledge of just basic gate-level behaviour, we simulate the circuit below with waveformsfor a andb and calculate the behaviour forc , d, ande.

a

b

c d

e

a

b

c

d

e

0ns 10ns 12ns 15ns

Different Programs, Same Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There are many different VHDL programs that will synthesizeto this circuit. Three examples are:

26 CHAPTER 1. VHDL

process (a,b)begin

c <= a and b;end process;process (b,c,d)begin

d <= not c;e <= b and d;

end process;

process (a,b,c,d)begin

c <= a and b;d <= not c;e <= b and d;

end process;

process (a,b)begin

c <= a and b;end process;process (c)begin

d <= not c;end process;process (b,d)begin

e <= b and d;end process;

The goal of the VHDL semantics is that all of these programs will have the same behaviour.The two main challenges to make this happen are: a value change on a signal must propagateinstantaneously, and all gates must operate in parallel. Wewill return to these points in section 1.6.3

1.6.2 Temporal Granularities of Simulation

There are several different granularities of time to analyze VHDL behaviour. In this course, wewill discuss three major granularities: clock cycles, timing simulation, and “delta cycles”.

clock-cycle• smallest unit of time is a clock cycle

• combinational logic has zero delay

• flip-flops have a delay of one clock cycle

• used for simulation early in the design cycle

• fastest simulation run times

timing simulation• smallest unit of time is a nano, pico, or fempto second

• combinational logic and wires have delay as computed by timing analysis tools

• flip-flops have setup, hold, and clock-to-Q timing parameters

• used for simulation when fine-tuning design and confirming that timing contraints aresatisfied

• slow simulation times for large circuits

delta cycles• units of time are artifacts of VHDL semantics and simulationsoftware

• simulation cycles, delta cycles, and simulation steps are infinitesimally small amounts oftime

• VHDL semantics are defined in terms of these concepts

In assignments and exams, you will need to be able to simulateVHDL code at each of the threedifferent levels of temporal granularity. In the laboratories and project, you will use simulation

1.6.3 Intuition Behind Delta-Cycle Simulation 27

programs for both clock-cycle simulation and timing simulation. We don’t have access to a pro-gram that will produce delta-cycle waveforms, but if anyoneis looking for a challenging co-op jobor fourth-year design project....

For the remainder of section 1.6, we’ll look at only the deltacycle view of the world.

1.6.3 Intuition Behind Delta-Cycle Simulation

Zero-delay simulation might appear to be the simpler than simulation with delays through gates(timing simulation), but in reality, zero-delay simulation algorithms are more complicated thanalgorithms for timing simulation. The reason is that in zero-delay simulation, a sequence of de-pendent events must appear to happen instantaneously (in zero time). In particular, the effect of anevent must propagate instantaneously through the combinational circuitry.

Two fundamental rules for zero-delay simulation:

1. events appear to propagate through combinational circuitry instantaneously.

2. all of the gates appear to operate in parallel

To make it appear that events propagate instaneously, VHDL introduces an artificial unit of time,thedeltacycle, to represent an infinitesimally small amount of time.In each delta cycle, every gatein the circuit will sample its inputs, compute its result, and drive its output signal with the result.

Because software executes in serial, a simulator cannot run/simulate multiple gates in parallel.Instead, the simulator must simulate the gates one at a time,but make the waveforms appear asif all of the gates were simulated in parallel. In each delta cycle, the simulator will simulate anygate whose input changed in the previous delta cycle. To preserve the illusion that the gates ran inparallel, the effect of simulating a gate remains invisibleuntil the end of the delta cycle.

1.6.4 Definitions and Algorithm

1.6.4.1 Process Modes

An architecture contains a set of processes. Each process isin one of the following modes: active,suspended, or postponed.

Note: “postponed” This use of the word “postponed” differs from that inthe VHDL Standard. We won’t be using postponed processes as defined in theStandard.

Note: “postponed” “Postponed” in VHDL terminology is a synonym forsome operating-systems’ usage of “ready” to describe a process that is readyto execute.

28 CHAPTER 1. VHDL

suspend

resume

activ

ate

active

suspendedpostponed

• Suspended

– Nothing to currently execute

– A process stays suspended until the eventthat it is waiting for occurs: either achange in a signal on its sensitivity listor the condition in await statement

• Postponed

– Wants to execute, but not currently active

– A process stays postponed until the sim-ulator chooses it from the pool of post-poned processes

• Active

– Currently executing

– A process stays active until it hits awaitstatement or sensitivity list, at whichpoint it suspends

Figure 1.15: Process modes

1.6.4.2 Simulation Algorithm

The algorithm presented here is a simplification of the actual algorithm in Section 12.6 of theVHDL Standard. The most significant simplification is that this algorithm does not support de-layed assignments. To support delayed assignments, each signal’s provisional value would be gen-eralized to anevent wheel, which is a list containing the times and values for multipleprovisionalassignments in the future.

A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted tothe semantics of executing processes.

The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Simulations start at step 1 with all processes postponed andall signals with a default value (e.g.,’U’ for std logic ).

1.6.4 Definitions and Algorithm 29

1. While there are postponed processes:

(a) Pick one or more postponed processes to execute (become active).

(b) As a process executes, assignments to signals areprovisional— new values donot become visible until step 3

(c) A process executes until it hits its sensitivity list or await statement, at which pointit suspends. At a wait statement, the process will suspend even if the condition istrue during the current simulation cycle.

(d) Processes that become suspended stay suspended until there are no more post-poned or active processes.

2. Each process looks at signals that changed value (provisional value differs from visiblevalue) and at the simulation time. If a signal in a process’s sensitivity list changed value,or if the wait condition on which a process is suspended became true, then the processresumes (becomes postponed).

3. Each signal that changed value is updated with its provisional value (the provisionalvalue becomes visible).

4. If there are no postponed processes, then increment simulation time to the next sched-uled event.

Note: Parallel execution In n-threaded execution, at most n processes areactive at a time

30 CHAPTER 1. VHDL

1.6.4.3 Delta-Cycle Definitions

Definition simulation step: Executing one sequential assignment or process modechange.

Definition simulation cycle: The operations that occur in one iteration of the simulationalgorithm.

Definition delta cycle: A simulation cycle that does not advance simulation time.Equivalently: A simulation cycle with zero-delay assignments where the assignmentcauses a process to resume.

Definition simulation round: A sequence of simulation cycles that all have the samesimulation time. Equivalently: a contiguous sequence of zero or more delta cyclesfollowed by a simulation cycle that increments time (i.e., the simulation cycle is not adelta cycle).

Note: Official and unofficial terminology Simulation cycle and delta cycleare official definitions in the VHDL Standard. Simulation step and simulationround are not standard definitions. They are used in E&CE 327 because weneed words to associate with the concepts that they describe.

1.6.5 Example 1: Process Execution (Bamboozle) 31

1.6.5 Example 1: Process Execution (Bamboozle)

This example (Bamboozle) and the next example (Flummox, section 1.6.6) are very similar. TheVHDL code for the circuit is slightly different, but the hardware that is generated is the same. Thestimulus for signalsa andb also differ.

entity bamboozle isbeginend bamboozle;

architecture main of bamboozle issignal a, b, c, d : std_logic;

beginprocA : process (a, b) begin

c <= a AND b;end process;procB : process (b, c, d)begin

d <= NOT c;e <= b AND d;

end process;procC : processbegin

a <= ’0’;b <= ’1’;wait for 10 ns;a <= ’1’;wait for 2 ns;b <= ’0’;wait for 3 ns;a <= ’0’;wait for 20 ns;

end main;

Figure 1.16: Example bamboozle circuit for process execution

32 CHAPTER 1. VHDL

Initial conditions (Shown in slides, not in notes)

Step 1(a): ActivateprocA (Shown in slides, not in notes)

a

b

c d

e

U

U

U UU

procA: process (a, b) begin

c <= a AND b;

end process;

procB: process (b, c, d) begin

d <= NOT c;

end process;

e <= b AND d;

A

P

Step 1(a): Activate procA

procC: process begin

a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

P

wait for 20 ns;

a

b

c

d

e

U

U

U

U

U

0ns

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

A

?

Step 1(c): SuspendprocA (Shown in slides, not in notes)

Step 1(a): ActivateprocC (Shown in slides, not in notes)

Step 1(b): Provisional assignment toa (Shown in slides, not in notes)

Step 1(b): Provisional assignment tob (Shown in slides, not in notes)

a

b

c d

e

U

U

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

P

0

Step 1(b): Provisional assignment to b


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

A

U

1

wait for 20 ns;

a

b

c

d

e

U

U

U

U

U

0ns

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

?

U

A S

A

U

U


Step 1(a): ActivateprocB (Shown in slides, not in notes)

Step 1(b): Provisional assignment tod (Shown in slides, not in notes)

Step 1(b): Provisional assignment toe (Shown in slides, not in notes)

Step 1(c): SuspendprocB (Shown in slides, not in notes)

a

b

c

d

e

0ns

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

?

a

b

c d

e

U

U

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

Step 1(c): Suspend procB


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

U UU

wait for 20 ns;

A S

A

U

U

S

A

U

U

U

S

a

b

c d

e

U

U

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

All processes suspended


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

U UU

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

?

A S

A

U

U

S

A

U

U

U

S

E

?

Step 3: Update signal values(Shown in slides, not in notes)

34 CHAPTER 1. VHDL

a

b

c d

e

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

P

P

0

1


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

U UU

wait for 20 ns;

0ns

P A S

A S

A

U

U

U

S

a

b

c

d

e

U

U

U

U

U

procA

procB P

procC P

sim round

sim cycle

delta cycle

B

B

?

P

P

Step 3: Update signal values

U

U

0

1

a

b

c d

e

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

Step 4: Simulation time remains at 0 ns --- delta cycle


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

A S

A

U

U

S

A

U

U

U

S

P

P

0

1

E

E


Step 1(a): ActivateprocA (Shown in slides, not in notes)

Step 1(b): Provisional assignment toc (Shown in slides, not in notes)

Step 1(c): SuspendprocA (Shown in slides, not in notes)





a

b

c d

e

U UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

U0



a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

U

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

P

P

A

U

S

A

U

U

S

U

U

U

0

1

?


Step 4: Simulation time remains at 0ns — delta cycle(Shown in slides, not in notes)

Compact simulation cycle(Shown in slides, not in notes)

Begin next simulation cycle(Shown in slides, not in notes)





All processes suspended(Shown in slides, not in notes)

36 CHAPTER 1. VHDL

a

b

c d

e

0 UU


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

1U



a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

P

B

A

U

S

U

U

U

E

B E

U

U

U

U

0

1

P

P

0

?


a

b

c d

e

0U


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

P

0

1

1


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

P

B

A

U

S

U

U

U

E

B E

U

U

U

U

0

1

P

P

0


1

P

?

Compact simulation cycle (Shown in slides, not in notes)

Begin next simulation cycle (Shown in slides, not in notes)






a

b

c d

e


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

U0 11

1

Step 1(c): Suspend procB


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

U

U

U

E

B E

B

B

U

E

E

B

?

U

A

U

S

U

U

U

0

1

P

P

P P

0

1


a

b

c d

e


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

0 11


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

U

U

U

E

B E

B

B

U

E

E

B

U

P A

U

S

U

U

U

0

1

P

P

P


1

?

0

1


38 CHAPTER 1. VHDL


Step 1: No postponed processes(Shown in slides, not in notes)

a

b

c d

e


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

S

S

0

1

0 11

Step 1: no postponed processes


a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

U

U

U

E

B E

B

B

U

E

E

B

U

U

E

E

10ns

U

U

U

0

1

P

P

P P

0

1

1




Step 1(a): Activate procC (Shown in slides, not in notes)

Step 1(b): Provisional assignment toa (Shown in slides, not in notes)

Step 1(c): Suspend procC(Shown in slides, not in notes)

Step 2: Check sensitivity list; resume processes(Shown in slides, not in notes)


a

b

c d

e


c <= a AND b;

end process;


d <= NOT c;

end process;

e <= b AND d;

P

S 1

0 11

1



a <= ’0’;

b <= ’1’;

a <= ’1’;

b <= ’0’;

a <= ’0’;

wait for 10 ns;

wait for 2 ns;

wait for 3 ns;

end process;

S

wait for 20 ns;

0ns

a

b

c

d

e

U

U

U

U

U

procA P

procB P

procC P

sim round

sim cycle

delta cycle

B

B

B

U

U

E

E

B

U

U

U

E

B E

B

B

U

E

E

B

U

B

U

E

E

P A S

B

B

B

10ns

U

U

U

0

1

1

P

P

P

P P

0

1

1


40 CHAPTER 1. VHDL

1.6.6 Example 2: Process Execution (Flummox)

This example is a variation of the Bamboozle example from section 1.6.5.

entity flummox isbeginend flummox;

architecture main of flummox issignal a, b, c, d : std_logic;

beginproc1 : process (a, b, c) begin

c <= a AND b;d <= NOT c;

end process;proc2 : process (b, d)begin

e <= b AND d;end process;proc3 : processbegin

a <= ’1’;b <= ’0’;wait for 3 ns;b <= ’1’;wait for 99 ns;

end main;

Figure 1.17: Example flummox circuit for process execution

a

b

c

d

e

proc1

proc2

proc3

delta cyclesim cycle

sim round BBB

PPP

U

U

U

U

U

A

U

SA

1

0

S

A S

U

U

EE

PP

A

0

U

SA S

BB E

E

P A S

0

1

BB E

E

P A S

0

B EE

P A S

1

PP A S

1

A S

1

1

BB

BEEP A S

1

0

P A S

00

BBE

E EE

BB

0ns

BEE

102ns3ns+1δ +2δ +3δ

1.6.6 Example 2: Process Execution (Flummox) 41

To get a more natural view of the behaviour of the signals, we draw just the waveforms and use atimescale of nanoseconds plus delta cycles:

a

b

c

d

e

U

U

U

U

U

+1δ +2δ +3δ3ns

+1δ +2δ +3δ0ns 102ns

U

U

U

U

U

Finally, we draw the behaviour of the signals using the standard time scale of nanoseconds. Noticethat the delta-cycles within a simulation round all collapse to the left, so the signals change valueexactly at the nanosecond boundaries. Also, the glitch on e dissappears.

Answer:

a

b

c

d

e

3ns0ns 102ns

U

U

U

U

U

2ns1ns 4ns 100ns 101ns

Note and Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

Note: If a signal is updated with the same value it had in the previous sim-ulation cycle, then it doesnot change, and therefore doesnot trigger processesto resume.

Question: What are the different granularities of time that occur whendoingdelta-cycle simulation?

42 CHAPTER 1. VHDL

Answer:simulation step, delta cycle, simulation cycle, simulation round

Question: What is the order of granularity, from finest to coarsest, amongst thedifferent granularities related to delta-cycle simulation?

Answer:Same order as listed just above. Note: delta cycles have a finer granularity

that simulation cycles, because delta cycles do not advance time, whilesimulation cycles that are not delta cycles do advance time.

1.6.7 Example: Need for Provisional Assignments

This is an example of processes where updating signals during a simulation cycle leads to differentresults for different process execution orderings.

architecture main of swindle isbegin

p_c: process (a, b) beginc <= a AND b;

end process;p_d: process (a, c) begin

d <= a XOR c;end process;

end main;

a

b

c d

Figure 1.18: Circuit to illustrate need for provisional assignments

1. Start with all signals at’0’ .

2. Simultaneously change toa = ’1’ andb = ’1’ .

1.6.7 Example: Need for Provisional Assignments 43

. .

If assignments are not visible within same simulation cycle(correct: i.e. provisional assignmentsare used)

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled beforep d, thend willhave a’1’ pulse.

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p d is scheduled beforep c , thend willhave a’1’ pulse.. .

If assignments are visible within same simulation cycle (incorrect)

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p c is scheduled beforep d, thend willstay constant’0’ .

a

b

c

d

0

0

0

0

p_d

p_c P

P

A S

A S P A S

If p d is scheduled beforep c , thend willhave a’1’ pulse.

With provisional assignments, both orders of scheduling processes result in the same behaviouron all signals. Without provisional assignments, different scheduling orders result in differentbehaviour.

44 CHAPTER 1. VHDL

1.6.8 Delta-Cycle Simulations of Flip-Flops

This example illustrates the delta-cycle simulation of a flip-flop. Notice how the delta-cycle simu-lation captures the expected behaviour of the flip flop: the signalq changes at the same time (10ns)as rising edge on the clock.

p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;

end process;

p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;flop : process ( clk ) begin

if rising_edge( clk ) thenq <= a;

end if;end process;

a

clk

q

flop

p_a

p_clk

sim roundsim cycle

delta cycle

0ns

PP

U

U

U

P

U

BBB

EE

A SA S

U

A S P A S

EE

B

10ns0ns+1δ

PA S

0

0

B/E

A SP

U

15ns

P A S

20ns

P A S

30ns

P A SA S

1

0

0

A SP

1

1

B/E

B

BB

EE

EE

EE

EE

E B E B E B EB E B/E

B/E

B/E

B/E

B/E

B/EBB B E

BB B E

35ns

1

P

Redraw with Normal Time Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

To clarify the behaviour, we redraw the same simulation using a normal time scale.

a

clk

q

0ns 10ns 20ns

U

U

5ns 15ns 30ns 35ns

U

25ns

1.6.8 Delta-Cycle Simulations of Flip-Flops 45

Back-to-Back Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .

In the previous simulation, the input to the flip-flop (a) changed several nanoseconds before therising-edge on the clock. In zero delay simulation, the output of a flip-flop changes exactly onthe rising edge of the clock. This means that the input to the next flip-flop will change at exactlythe same time as a rising edge. This example illustrates how delta-cycle simulation handles thesituation correctly.

p_a : process begina <= ’0’;wait for 15 ns;a <= ’1’;wait for 20 ns;

end process;

p_clk : process beginclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;flops : process ( clk ) begin

if rising_edge( clk ) thenq1 <= a;q2 <= q1;

end if;end process;

a

clk

q1

flops

p_a

p_clk

sim roundsim cycle

delta cycle

10ns

P A S

0

0

B/E

A SP

U

15ns

P A S

20ns

P A S

30ns

P A SA S

1

0

0

A SP

1

1

B/E

B

BB

EE E

EE

EE

E B E B E B EB E B/E

B/E

B/E

B/E

B/E

B/EBB B E

BB B E

35ns

1

P

U

q2 U U U

B


To clarify the behaviour, we redraw the same simulation using a normal time scale.

a

clk

q1

0ns 10ns 20ns

U

U

5ns 15ns 30ns 35ns

U

25ns

q2 U

46 CHAPTER 1. VHDL

External Inputs and Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

In our work so far with delta-cycle simulation, we have worked through the mechanics of simula-tion. This example applies knowledge of delta-cycle simulation at a conceptual level. We couldanswer the question by thinking about the semantics of delta-cycle simulation or by mechanicalydoing the simulation.

Question: Do the signalsb1 andb2 have the same behaviour from 20–30 ns?

architecture mathilde of sauv e issignal clk, a, b : std_logic;

beginprocess begin

clk <= ’1’;wait for 10 ns;clk <= ’0’;wait for 10 ns;

end process;process begin

wait for 20 ns;a1 <= ’1’;


wait until rising_edge(clk);a1 <= ’1’;


wait until rising_edge( clk );b1 <= a1;b2 <= a2;

end process;end architecture;

Answer:

The signals b1 and b2 will have the same behaviour if a1 and a2 have thesame behaviour. The difference in the code between a1 and a2 is that a1 iswaiting for 20ns and a2 is waiting until a rising edge of the clock. There is arising edge of the clock at 20ns, so we might be tempted to conclude(incorrectly) that both a1 and a2 transition from ’U’ to 0 at exactly 20ns andtherefore have exactly the same behaviour.


The difference between the behaviour of a1 and a2 is that in the firstsimulation cycle for 20 ns, the process for a1 becomes postponed, while theprocess for a2 becomes postponed only after the rising edge of clock.

The signal a1 is waiting for 20ns, so in the first simulation cycle for 20ns, theprocess for a1 becomes postponed. In the second simulation cycle for 20ns,the clock toggles from 0 to 1 and a1 toggles from ’U’ to 1. The rising edgeon the clock causes the processes for a2, b1, and b2 to become postponed.

In the third simulation cycle for 20ns:

• a2 toggles from ’U’ to 1.

• b1 sees the value of 1 for a1, because a1 became 1 in the first simulationcycle.

• b2 sees the old value of ’U’ for a2, because the process for a2 did not runin the second simulation cycle.

0ns

clk

a1

a2

b1

b2

proc_clk

proc_a1

proc_a2

sim round

sim cycle

delta cycle

B/E

B/E

B

10ns

P

P

20ns

proc_b

20ns+1δ

B

B

E

E

A S

A S

U

P

P

20ns+2δ

A S

A S

30ns

B

E

E

U

48 CHAPTER 1. VHDL

Testbenches and Clock Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

env : process begina <= ’1’;clk <= ’0’;wait for 10 ns;a <= ’0’;clk <= ’1’;wait for 10 ns;

end process;

flop : process ( clk ) beginif rising_edge( clk ) then

q1 <= aend if;

end process;

a

clk

q1

flop2

flop1 PP

SA S

PA S

U

U

U

env PA

U

U

U

P SA S

A

sim roundsim cycle

delta cycle BBB

EE E

EB

0ns 0ns+1δ 10ns

BB E

E

P SA

BB E

E

1

0 1

0

SA S

APP

BBB

EE

20ns

U


a

clk

q1

0ns 10ns 20ns

U

U

U


Note: Testbench signals For consistent results across different simulators,simulation scripts vs test benches, and timing-simulationvs zero-delay simula-tion donot change signals in your testbench or script at the same time astheclock changes.

a is output of clocked or combina-tional process

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

a is output of timed process(testbench or environment)POORDESIGN

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

a is output of timed process (test-bench or environment)GOODDESIGN

a

clk

q1

0ns 10ns 20ns

U

U

U

30ns 40ns 50ns 60ns

50 CHAPTER 1. VHDL

1.7 Register-Transfer-Level Simulation

Delta-cycle simulation is very tedious for both humans and computers. For many circuits, thecomplexity of delta-cycle simulation is not needed and register-transfer-level simulation, which ismuch simpler, can be used instead.

The major complexities of delta-cycle simulation come fromrunning a process multiple timeswithin a single simulation round and keeping track of the modes of the proceses. Register-transfer-level simulation avoids both of these complexities. By evaluating each signal only once per sim-ulation round, an entire simulation round can be reduced to asingle column in a timing diagram.The disadvantage of register-transfer-level simulation is that it does not work for all VHDL pro-grams — in particular, it does not support combinational loops.

a

b

c

d

e

proc1

proc2

proc3


sim round BBB

PPP

U

U

U

U

U

A

U

SA

1

0

S

A S

U

U

EE

PP

A

0

U

SA S

BB E

E

P A S

0

1

BB E

E

P A S

0

B EE

P A S

1

PP A S

1

A S

1

1

BB

BEE

P A S

1

0

P A S

0

102ns

0

BBE

E EE

EBB

0ns 3ns

BEE

U

0ns+1δ 0ns+2δ 0ns+2δ 3ns+1δ 3ns+2δ 3ns+3δ

a

b

c

d

e

U

U

U

U

U

1

0

0

1

0

1

1

0

0ns 1ns 2ns 3ns 102ns

Delta cycle simulation RTL simulation

1.7.1 Overview

In delta-cycle simulations, we often simulated the same process multiple times within the samesimulation round. In looking at the circuit though, we mentally can calculate the output valueby evaluating each gate only once per simulation round. For both humans and computers (or thehumans waiting for results from computers), it is desirableto avoid the wasted work of simulatinga gate when the output will remain at’U’ or will change again later in the same simulation round.

In register-transfer-level simulation, we evaluate each gate only once per simulation round. Register-transfer-level simulation is simpler and faster than delta-cycle simuation, because it avoids deltacycles and provisional assignments.

In delta-cycle simulation, we evaluate a gate multiple times in a single simulation round if theprocess that drives the gate is active in multiple simulation cycles, which happens when the processis triggered in multiple simulation cycles. To avoid this, we must evaluate a signal only after all ofthe signals that it depends on have stable values, that is, the signals will not change value later inthe simulation round.

A combinational loopis a circuit that contains a cyclic path through the circuit that includes onlycombinational gates. Combinational loops can cause signals to oscillate, which in delta-cyclesimulation with zero-delay assignments, corresponds to aninfinite sequence of delta cycles. We

1.7.1 Overview 51

immediately see that when doing zero-delay simulation of a combinational loop such asa <= not(a); , the change ona will trigger the process to re-run and re-evaluatea an infinitenumber of times. Hence, register-transfer-level simulation does not support combinational loops.

To make register-transfer simulation work, we preprocess the VHDL program and transform it sothat each process is dependent upon only those processes that appear before it. This dependencyordering is calledtopological ordering. If a circuit has combinational loops, we cannot sort theprocesses into a topological order.

The register-transfer level is a coarser level of temporal abstraction than the delta-cycle level.In delta-cycle simulation, many delta-cycles can elapse without an increment in real time (e.g.nanoseconds). In register-transfer-level simulation, all of the events that take place in the samemoment of real time take place at same moment in the simulation. In other words, all of the eventsthat take place at the same time are drawn in the same column ofthe waveform diagram.

Register-transfer-level simulation can be done for legal VHDL code, either synthesizable or unsyn-thesizable, so long as the code does not contain combinational loops. For any piece of VHDL codewithout combinational loops, the register-transfer-level simulation and the delta-cycle simulationwill have same value for each signal at the end of each simulation round.

By sorting the processes in topological order, when we execute a process, all of the signals that theprocess depends on will have already been evaluated, and so we know that we are reading the final,stable values that each signal will have for that moment in time. This is good, because for mostprocesses, we want to read the most recent values of signals.The exceptions are timed processesthat are dependent upon other timed processes running at thesame moment in time and clockedprocesses that are dependent upon other clocked processes.

process begina <= ’0’;wait for 10 ns;a <= ’1’;...

end process;

process beginb <= ’0’;wait for 10 ns;b <= a;...

end process;

Question: In this code, what valueshouldb have 10 ns?

Answer:Both processes will execute in

the same simulation cycle at 10ns. The statement b <= a willsee the value of a from theprevious simulation cycle, whichis before a <= ’1’; isevaluated. The signal b will be’0’ at 10 ns.

As the above example illustrates, if a clocked process readsthe values of signals from processesthat resume at the same time, it must read thepreviousvalue of those signals. Similarly, if aclocked process reads the values of signals from processes that are sensitive to the same clock,those processes will all resume in the same simulation cycle— the cycle immediately after therising-edge of the clock (assuming that the processes useif rising edge or wait untilrising edge statements). Because the processes run in the same simulation cycle, they all read

52 CHAPTER 1. VHDL

the previous values of the signals that they depend on. If this were not the case, then the VHDLcode for pair of back-to-back flip flops would not operate correctly, because the output of the firstflip-flop would appear immediately at the output of the secondflip-flop.

Simulation rounds begin with incrementing time, which triggers timed processes. Therefore, thefirst processes in the topological order are the timed processes. Timed processes may be run in anyorder, and they read the previous values of signals that theydepend on. This gives the same effectas in delta-cycle simulation, where the timed processes would run in the same simulation cycle andread the values that signals had before the simulation cyclebegan.

We then sort the clocked and combinational processes based on their dependencies, so that eachprocess appears (is run) after all of the processes on which it depends.

Although a clocked process may read many signals, we say thata clocked process is dependentupon only its clock signal. It is the change in the clock signal that causes the process to resume.So, as long as the process is run after the clock signal is stable, we can be sure that it will not needto be run again at this time step. Clocked processes may be runin any order. They read thecurrentvalue of theirclock signal and thepreviousvalue of the other signalsthat they depend on. Aswith timed processes, this gives the same effect as in delta-cycle simulation, where the clock edgewould trigger the clocked processes to run in the same simulation cycle and the processes wouldread the values that signals had before the simulation cyclebegan.

1.7.2 Technique for Register-Transfer Level Simulation

1. Pre-processing

(a) Separate processes into combinational and non-combinational (clocked and timed)

(b) Decompose each combinational process into separate processes with one target signalper process

(c) Sort processes into topological order based on dependencies

2. For each clock cycle or unit of time:

(a) Run non-combinational processes in any order. Non-combinational assignments readfrom earlier clock cycle / time step, except that clocked processes read the current valueof the clock signal.

(b) Run combinational processes in topological order. Combinational assignments readfrom current clock cycle / time step.

Combinational Process Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.7.3 Examples of RTL Simulation 53

proc(a,b,c)if a = ’1’ then

d <= b;e <= c;

elsed <= not b;e <= b and c;

end if;end process;

proc(a,b,c)if a = ’1’ then

d <= b;else

d <= not b;end if;

end process;proc(a,b,c)

if a = ’1’ thene <= c;

elsee <= b and c;

end if;end process;

Original code After decomposition

1.7.3 Examples of RTL Simulation

1.7.3.1 RTL Simulation Example 1

We revisit an earlier example from delta-cycle simulation,but change the code slightly and doregister-transfer-level simulation.

1. Original code:proc1: process (a, b, c) begin

d <= NOT c;c <= a AND b;

end process;

proc2: process (b, d) begine <= b AND d;

end process;

proc3: process begina <= ’1’;b <= ’0’;wait for 3 ns;b <= ’1’;wait for 99 ns;

end process;

2. Decompose combinational processes into single-target processes:

54 CHAPTER 1. VHDL

proc1d: process (c) begind <= NOT c;

end process;

proc1c: process (a, b) beginc <= a AND b;

end process;


end process;

proc1c: process (a, b) beginc <= a AND b;

end process;

proc1d: process (c) begind <= NOT c;

end process;


end process;

Decomposed Sorted

3. To sort combinational processes into topological order,moveproc1d after proc1c , be-caused depends onc .

4. Run timed process (proc3 ) until suspend atwait for 3 ns; .• The signala gets’1’ from 0 to 3 ns.

• The signalb gets’0’ from 0 to 3 ns.

5. Runproc1c• The signalc getsa AND b(0 AND 1 = ’0’ ) from 0 to 3 ns.

6. Runproc1d• The signald getsNOT c(NOT 0 = ’1’ ) from 0 to 3 ns.

7. Runproc2• The signale getsb AND d(0 AND 1 = ’0’ ) from 0 to 3 ns.

8. Run the timed process until suspend atwait for 99 ns; , which takes us from 3ns to102ns.

9. Run combinational processes in topological order to calculate values onc , d, e from 3ns to102ns.

Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

a

b

c

d

e

U

U

U

U

U

1

0

0

1

0

1

1

0



Question: Draw the RTL waveforms that correspond to the delta-cycle waveformbelow.

a

b

c

d

e

proc1

proc2

proc3


sim round BBB

PPP

U

U

U

U

U

A

U

SA

1

0

S

A S

U

U

EE

PP

A

0

U

SA S

BB E

E

P A S

0

1

BB E

E

P A S

0

B EE

P A S

1

PP A S

1

A S

1

1

BB

BEE

P A S

1

0

P A S

0

102ns

0

BBE

E EE

EBB

0ns 3ns

BEE

U

0ns+1δ 0ns+2δ 0ns+2δ 3ns+1δ 3ns+2δ 3ns+3δ

Answer:

a

b

c

d

e

U

U

U

U

U

1

0

0

1

0

1

1

0


56 CHAPTER 1. VHDL

Example: Communicating State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

huey: process

begin

clk <= ’1’;

wait for 10 ns;

clk <= ’0’;

wait for 10 ns;

end process;

dewey: process

begina <= to_unsigned(0,4);

wait until re(clk);

while (a < 4) loop

a <= a + 1;

wait until re(clk);

end loop;

end process;

louie: process

begin

wait until re(clk);

d <= ’1’;

if (a >= 2) then

d <= ’0’;

wait until re(clk);

end if;

end process;

clk

a

d

I 0 5 10 15 20 25 30 35 40 45 50 55 60 7065 75 80 85 90 95 100 110 120

U

U

U

0

0

1

1

1

1

2 3 4

1 0

0 1

1 0 1

30 50 70 9090

11010

10 30 50

70

90

110

10

20

30

40

50

60

70

80

90

100

110


A Related Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Small changes to the code can cause significant changes to thebehaviour.riri: process

begin

clk <= ’1’;

wait for 10 ns;

clk <= ’0’;

wait for 10 ns;

end process;

fifi: process

begina <= to_unsigned(0,4);

wait until re(clk);

while (a < 4) loop

a <= a + 1;

wait until re(clk);

end loop;

end process;

loulou: process

beginwait until re(clk);

d <= ’1’;

if (a < 2) then

d <= ’0’;

wait until re(clk);

end if;

end process;

clk

a

d

I 0 5 10 15 20 25 30 35 40 45 50 55 60 7065 75 80 85 90 95 100 110 120

58 CHAPTER 1. VHDL

1.8 VHDL and Hardware Building Blocks

This section outlines the building blocks for register transfer level design and how to write VHDLcode for the building blocks.

1.8.1 Basic Building Blocks

(also: n-to-1 muxes)2:1 mux

WE

A0

DI0

DO0

A1 DO1

WE

A

DI

DO CE

S

R D Q

Hardware VHDLAND, OR, NAND, NOR, XOR,XNOR

and , or , nand , nor , xor , xnor

multiplexer if-then-else , case statement,selected assignment, conditional as-signment

adder, subtracter, negater +, - , -shifter, rotater sll , srl , sla , sra , rol , rorflip-flop wait until , if-then-else ,

rising edgememory array, register file, queue 2-d array or library component

Figure 1.19: RTL Building Blocks

1.8.2 Deprecated Building Blocks for RTL 59

1.8.2 Deprecated Building Blocks for RTL

Some of the common gates you have encountered in previous courses should be avoided whensynthesizing register-transfer-level hardware, particularly if FPGAs are the implementation tech-nology.

1.8.2.1 An Aside on Flip-Flops and Latches

flip-flop Edge sensitive: output only changes on rising (or falling) edge of clock

latch Level sensitive: output changes whenever clock is high (or low)

A common implementation of a flip-flop is a pair of latches (Master/Slave flop).

Latches are sometimes called “transparent latches”, because they are transparent (input directlyconnected to output) when the clock is high.

The clock to a latch is sometimes called the “enable” line.

There is more information in the course notes on timing analysis for storage devices (Section 5.2).

1.8.2.2 Deprecated Hardware

Latches• Use flops, not latches

• Latch-based designs are susceptible to timing problems

• The transparent phase of a latch can let a signal “leak” through a latch — causing thesignal to affect the output one clock cycle too early

• It’s possible for a latch-based circuit to simulate correctly, but not work in real hardware,because the timing delays on the real hardware don’t match those predicted in synthesis

T, JK, SR, etc flip-flops• Limit yourself to D-type flip-flops

• Some FPGA and ASIC cell libraries include only D-type flip flops. Others, such as Al-tera’s APEX FPGAs, can be configured as D, T, JK, or SR flip-flops.

Tri-State Buffers• Use multiplexers, not tri-state buffers

• Tri-state designs are susceptible to stability and signal integrity problems

• Getting tri-state designs to simulate correctly is difficult, some library components don’tsupport tri-state signals

• Tri-state designs rely on the code never letting two signalsdrive the bus at the same time

• It can be difficult to check that bus arbitration will always work correctly

60 CHAPTER 1. VHDL

• Manufacturing and environmental variablity can make real hardware not work correctlyeven if it simulates correctly

• Typical industrial practice is to avoid use of tri-state signals on a chip, but allow tri-statesignals at the board level

Note: Unfortunately and surprisingly, PalmChip has been awardedaUS patent for using uni-directional busses (i.e. multiplexers) for system-on-chip designs. The patent was filed in 2000, so all fourth-year designprojects since 2000 that use muxes on FPGAs will need to pay royalties toPalmChip

1.8.3 Hardware and Code for Flops

1.8.3.1 Flops with Waits and Ifs

The two code fragments below synthesize to identical hardware (flops).

If

process (clk)begin

if rising_edge(clk) thenq <= d;

end if;end process;

Waitprocessbegin

wait until rising_edge(clk);q <= d;

end process;

1.8.3.2 Flops with Synchronous Reset

The two code fragments below synthesize to identical hardware (flops withsynchronousreset).Notice that the synchronous reset is really nothing more than anAND gate on the input.

If

process (clk)begin

if rising_edge(clk) thenif (reset = ’1’) then

q <= ’0’;else

q <= d;end if;

end if;end process;

Wait

processbegin

wait until rising_edge(clk);if (reset = ’1’) then

q <= ’0’;else

q <= d0;end if;

end process;

1.8.3 Hardware and Code for Flops 61

1.8.3.3 Flops with Chip-Enable

The two code fragments below synthesize to identical hardware (flops with chip-enable lines).

If

process (clk)begin

if rising_edge(clk) thenif (ce = ’1’) then

q <= d;end if;

end if;end process;

Waitprocessbegin

wait until rising_edge(clk);if (ce = ’1’) then

q <= d;end if;

end process;

1.8.3.4 Flop with Chip-Enable and Mux on Input

The two code fragments below synthesize to identical hardware (flops with chip-enable lines andmuxes on inputs).

Ifprocess (clk)begin

if rising_edge(clk) thenif (ce = ’1’) then

if (sel = ’1’) thenq <= d1;

elseq <= d0;

end if;end if;

end if;end process;

Waitprocessbegin

wait until rising_edge(clk);if (ce = ’1’) then

if (sel = ’1’) thenq <= d1;

elseq <= d0;

end if;end if;

end process;

62 CHAPTER 1. VHDL

1.8.3.5 Flops with Chip-Enable, Muxes, and Reset

The two code fragments below synthesize to identical hardware (flops with chip-enable lines,muxes on inputs, andsynchronous reset). Notice that the synchronous reset is really nothingmore than a mux, or anAND gate on the input.

Note: The specific combination and order of tests is important to guaranteethat the circuit synthesizes to a flop with a chip enable, as opposed to a level-sensitive latch testing the chip enable and/or reset followed by a flop.

Note: The chip-enable pin on the flop is connected to bothce andreset.If the chip-enable pin was not connected toreset, then the flop would ignorereset unless chip-enable was asserted.

Ifprocess (clk)begin

if rising_edge(clk) thenif (ce = ’1’ or reset =’1’ ) then

if (reset = ’1’) thenq <= ’0’;

elsif (sel = ’1’) thenq <= d1;

elseq <= d0;

end if;end if;

end if;end process;

Waitprocessbegin

wait until rising_edge(clk);if (ce = ’1’ or reset = ’1’) then


elsif (sel = ’1’) thenq <= d1;

elseq <= d0;

end if;end if;

end process;

1.8.4 An Example Sequential Circuit

There are many ways to write VHDL code that synthesizes to theschematic in figure1.20. Themajor choices are:

1. Categories of signals

(a) All signals are outputs of flip-flops or inputs (no combinational signals)

(b) Signals include both flopped and combinational

2. Number of flopped signals per process

(a) All flopped signals in a single process

(b) Some processes with multiple flopped signals

(c) Each flopped signal in its own process

3. Style of flop code

1.8.4 An Example Sequential Circuit 63

(a) Flops useif statements

(b) Flops usewait statements

Some examples of these different options are shown in figures1.21–1.24.

S

R

S

R

sel reset

clk

c

a

entity and_not_reg isport (

reset,clk,sel : in std_logic;c : out std_logic

);end;

Schematic and entity for examples of different code organizations in Figures1.21–1.24

Figure 1.20: Schematic and entity forand not reg

One Process, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

architecture one_proc of and_not_reg issignal a : std_logic;

beginprocess begin


a <= ’0’;elsif (sel = ’1’) then

a <= NOT a;else

a <= a;end if;c <= NOT a;

end process;end one_proc;

Figure 1.21: Implementation of Figure1.20: all signals areflops, all flops in one process, flops use waits

64 CHAPTER 1. VHDL

Two Processes, Flops, Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

architecture two_proc_wait of and_not_reg issignal a : std_logic;

beginprocess begin



a <= NOT a;else

a <= a;end if;


wait until rising_edge(clk);c <= NOT a;

end process;end two_proc_wait;

Figure 1.22: Implementation of Figure1.20: all signals areflops, one flop per process, flops use waits

1.8.4 An Example Sequential Circuit 65

Two Processes with If-Then-Else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

architecture two_proc_if of and_not_reg issignal a : std_logic;

beginprocess (clk)begin



a <= NOT a;else

a <= a;end if;

end if;end process;process (clk)begin

if rising_edge(clk) thenc <= NOT a;

end if;end process;

end two_proc_if;

Figure 1.23: Implementation of Figure1.20: all signals areflops, one flop per process, flops use if-then-else

66 CHAPTER 1. VHDL

Concurrent Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

architecture comb of and_not_reg issignal a, b, d : std_logic;

beginprocess (clk) begin


a <= ’0’;else

a <= d;end if;

end if;end process;process (clk) begin

if rising_edge(clk) thenc <= NOT a;

end if;end process;d <= b when (sel = ’1’) else a;b <= NOT a;

end comb;

Figure 1.24: Implementation of Figure1.20: flopped and combinational signals, one flop per process, flops use if-then-else

1.9 Arrays and Vectors

VHDL supports multidimensional arrays over elements of anytype. The most common array is anarray ofstd logic signals, which has a predefined type:std logic vector . Throughoutthe rest of this section, we will discuss onlystd logic vector , but the rules apply to arraysof any type.

VHDL supports reading from and assigning to slices (aka “discrete subranges”) of vectors. Therules for working with slices of vectors are listed below andillustrated in figure1.25.

1. The ranges on both sides of the assignment must be the same.

2. The direction (downto or to ) of each slice must match the direction of the signal declara-tion.

3. The direction of the target and expression may be different.

1.10. ARITHMETIC 67

Declarations

--------------------------------------------------- -a, b : in std_logic_vector(15 downto 0);c, d, e : out std_logic_vector(15 downto 0);--------------------------------------------------- -ax, bx : in std_logic_vector(0 to 15);cx, dx, ex : out std_logic_vector(0 to 15);--------------------------------------------------- -m, n : in unsigned(15 downto 0);p, q, r : out unsigned(15 downto 0);--------------------------------------------------- -w, x : in signed(15 downto 0);y, z : out signed(15 downto 0)--------------------------------------------------- -

Legal code

c(3 downto 0) <= a(15 downto 12);cx(0 to 3) <= a(15 downto 12);(e(3), e(4)) <= bx(12 to 13);(e(5), e(6)) <= b(13 downto 12);

Illegal code

d(0 to 3) <= a(15 to 12); -- slice dirs must be same as decle(3) & e(2) <= b(12 to 13); -- syntax error on &p(3 downto 0) <= (m + n)( 3 downto 0); -- syntax error on )(z(3 downto 0) <= m(15 downto 12); -- types on lhs and rhs must match

Figure 1.25: Illustration of Rules for Slices of Vectors

1.10 Arithmetic

VHDL includes all of the common arithmetic and logical operators.

Use the VHDL arithmetic operators and let the synthesis toolchoose the better implementation foryou. It is almost impossible for a hand-coded implementation to beat vendor-supplied arithmeticlibraries.

To use the operators, you must choose which arithmetic package you wish to use (section 1.10.1).The arithmetic operators are overloaded, and you can usually use any mixture of constants and sig-nals of different types that you need (Section 1.10.3). However, you might need to convert a signalfrom one type (e.g.std logic vector ) to another type (e.g.integer ) (Section 1.10.7).

68 CHAPTER 1. VHDL

1.10.1 Arithmetic Packages

Rushton Ch-7 covers arithmetic packages. Rushton AppendexA.5 has the code listing for thenumeric std package.

To do arithmetic with signals, use thenumeric_std package. This package defines typessigned andunsigned , which arestd_logic vectors on which you can do signed or un-signed arithmetic.

numeric std supersedes earlier arithmetic packages, such asstd logic arith .

Use only one arithmetic package, otherwise the different definitions will clash and you can getstrange error messages.

1.10.2 Shift and Rotate Operations

Shift and rotate operations are described with three character acronyms:

〈 shift/rotate〉〈 left/r ight 〉〈 arithmetic/logical〉

The shift right arithmetic (sra ) operation preserves the sign of the operand, by copying themostsignificant bit into lower bit positions.

The shift left arithmetic (sla ) does the analogous operation, except that the least significant bit iscopied.

1.10.3 Overloading of Arithmetic

The arithmetic operators+, - , and* are overloaded onsigned vectors,unsigned vectors, andintegers. Tables1.1–1.4 show the different combinations of target and source types and widths thatcan be used.

Table 1.1: Overloading of Arithmetic Operations (+, - )

target src1/2 src2/1unsigned unsigned integerOK

— unsigned signed fails in analysis

In these tables “—” means “don’t care”. Also, src1/2 and src2/1 mean first or second operand, andrespectively second or first operand. The first line of the table means that either the fist operand isunsigned and the second is an integer, or the second operand is unsigned and the first is an integer.Or, more concisely: one of the operands is unsigned and the other is integer.

1.10.4 Different Widths and Arithmetic 69

1.10.4 Different Widths and Arithmetic

Table 1.2: Different Vector Widths and Arithmetic Operations (+, - )

target src1/2 src2/1narrow wide — fails in elaborationwide narrow int fails in elaborationwide wide — OK

narrow narrow narrow OKnarrow narrow int OK

Example vectorswide unsigned(7 downto 0)narrow unsigned(4 downto 0)

1.10.5 Overloading of Comparisons

Table 1.3: Overloading of Comparison Operations (=, /= , >=, >, <)

src1/2 src2/1unsigned integer OKsigned integer OK

unsigned signed fails in analysis

1.10.6 Different Widths and Comparisons

Table 1.4: Different Vector Widths and Comparison Operations (=, /= , >=, >, <)

src1/2 src2/1wide — OK

narrow — OK

70 CHAPTER 1. VHDL

1.10.7 Type Conversion

The functionsunsigned , signed , to integer , to unsigned and to signed are usedto convert between integers, std-logic vectors, signed vectors and unsigned vectors.

If you convert between two types of the same width, then no additional hardware will be generated.

The listing below summarizes the types of these functions.

unsigned( val : std_logic_vector ) return unsigned;signed( val : std_logic_vector ) return signed;

to_integer( val : signed ) return integer;to_integer( val : unsigned ) return integer;

to_unsigned( val : integer; width : natural) return unsigned;

to_signed( val : integer; width : natural) return signed;

The most common need to convert between two types arises whenusing a signal as an index intoan array. To use a signal as an index into an array, you must convert the signal into an integerusing the functionto_integer (Figure1.26).

signal i : unsigned( 3 downto 0);signal a : std_logic_vector(15 downto 0);...... a(i) ... -- BAD: won’t typecheck... a( to_integer(i) ) ... -- OK

Avoid (or at least take care when) converting a signal into aninteger and then performing arithmeticon the signal. The default size for integers is 32 bits, so sometimes when a signal is converted intoan integer, the resulting signals will be 32 bits wide.

library ieee;use ieee.std_logic_1164.all;use ieee.numeric_std.all;...

signal bit_sig : std_logic;signal uns_sig : unsigned(7 downto 0);signal vec_sig : std_logic_vector(255 downto 0);...bit_sig <= vec_sig( to_integer(uns_sig) );

...

Figure 1.26: Using an unsigned signal as an index to array

1.11. SYNTHESIZABLE VS NON-SYNTHESIZABLE CODE 71

To convert astd_logic_vector signal into an integer, you must first say whether the signalshould be interpreted as signed or unsigned. As illustratedin figure1.27, this is done by:

1. Convert thestd_logic_vector signal tosigned or unsigned , using the functionsigned or unsigned

2. Convert thesigned or unsigned signal into an integer, usingto_integer

library ieee;use ieee.std_logic_1164.all;use ieee.numeric_std.all;...

signal bit_sig : std_logic;signal std_sig : std_logic_vector(7 downto 0);signal vec_sig : std_logic_vector(255 downto 0);...bit_sig <= vec_sig( to_integer( unsigned( std_sig ) ) );

...

Figure 1.27: Using astd logic vector as an index to array

1.11 Synthesizable vs Non-Synthesizable Code

Synthesis is done by matching VHDL code against templates orpatterns. It’s important to useidioms that your synthesis tools recognizes. If you aren’t careful, you could write code that hasthe same behaviour as one of the idioms, but which results in inefficient or incorrect hardware.Section 1.8 described common idioms and the resulting hardware.

Most synthesis tools agree on a large set of idioms, and will reliably generate hardware for theseidioms. This section is based on the idioms that Synopsys, Xilinx, Altera, and Mentor Graphics areable to synthesize. One exception is that Altera’s Quartus does not support implicit state machines(as of v5.0).

Section 1.11.1 gives rules for unsynthesizable VHDL code. Section 1.11.2 gives rules for codethat is synthesizable, but violates the ece327 guidelines for good practices. The ece327 codingguidelines are designed to produce circuits suitable for FPGAs. Bad code for FPGAs producecircuits with the following features:• latches

• asynchronous resets

• combinational loops

• multiple drivers for a signal

72 CHAPTER 1. VHDL

• tri-state buffersWe limit our definition of bad practice to code that produces undesirable hardware. Coding stylesthat lead to inefficient hardware might be useful in the earlystages of the design process, when thefocus is on functionality and not optimality. As such, inefficient code is not considered bad prac-tice. Poor coding styles that do not affect the hardware, forexample, including extraneous signalsin a sensitivity list, should certainly be avoided, but fallinto the general realm of programmingguidelines and will not be discussed.

1.11.1 Unsynthesizable Code

1.11.1.1 Initial Values

Initial values on signals (UNSYNTHESIZABLE)

signal bad_signal : std_logic := ’0’;

Reason: In most implementation technologies, when a circuit powers up, the values on signalsare completely random. Some FPGAs are an exception to this. For some FPGAs, when a chip ispowered up, all flip flops will be’0’ . For other FPGAs, the initial values can be programmed.

1.11.1.2 Wait For

Wait for length of time (UNSYNTHESIZABLE)

wait for 10 ns;

Reason: Delays through circuits are dependent upon both the circuit and its operating environment,particularly supply voltage and temperature.

1.11.1.3 Different Wait Conditions

wait statements with different conditions in a process (UNSYNTHESIZABLE)

-- different clock signalsprocessbegin

wait until rising_edge(clk1);x <= a;wait until rising_edge(clk2);x <= a;

end process;

-- different clock edgesprocessbegin

wait until rising_edge(clk);x <= a;wait until falling_edge(clk);x <= a;

end process;

Reason: processes with multiple wait statements are turned into finite state machines. The waitstatements denote transitions between states. The target signals in the process are outputs of flip

1.11.1 Unsynthesizable Code 73

flops. Using different wait conditions would require the flipflops to use different clock signalsat different times. Multiple clock signals for a single flip flop would be difficult to synthesize,inefficient to build, and fragile to operate.

1.11.1.4 Multiple “if rising edge” in Process

Multiple if rising edge statements in a process (UNSYNTHESIZABLE)

process (clk)begin

if rising_edge(clk) thenq0 <= d0;

end if;if rising_edge(clk) then

q1 <= d1;end if;

end process;

Reason: The idioms for synthesis tools generally expect just a single if rising edge state-ment in each process. The simpler the VHDL code is, the easierit is to synthesize hardware.Programmers of synthesis tools make idiomatic restrictions to make their jobs simpler.

1.11.1.5 “if rising edge” and “wait” in Same Process

An if rising edge statement and await statement in the same process (UNSYNTHESIZ-ABLE)

process (clk)begin


end if;wait until rising_edge(clk);q0 <= d1;

end process;

Reason: The idioms for synthesis tools generally expect just a single type of flop-generating state-ment in each process.

74 CHAPTER 1. VHDL

1.11.1.6 “if rising edge” with “else” Clause

The if statement has arising edge condition and anelse clause (UNSYNTHESIZABLE).

process (clk)begin


elseq0 <= d1;

end if;end process;

Reason: Generally, an if-then-else statement synthesizes to a multiplexer. The condition that istested in the if-then-else becomes the select signal for themultiplexer. In anif rising edgewith else , the select signal would need to detect a rising edge onclk , which isn’t feasible tosynthesize.

1.11.1.7 “if rising edge” Inside a “for” Loop

An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys)

process (clk) beginfor i in 0 to 7 loop

if rising_edge(clk) thenq(i) <= d;

end if;end loop;

end process;

Reason: just an idiom of the synthesis tool.

Some loop statementsare synthesizable (Rushton Section 8.7). For-loops in generalare de-scribed in Ashenden. Examples of for loops in E&CE will appear when describing testbenches forfunctional verification (Chapter 4).

1.11.1 Unsynthesizable Code 75

Synthesizable Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

A synthesizable alternative to anif rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop.

process (clk) beginif rising_edge(clk) then

for i in 0 to 7 loopq(i) <= d;

end loop;end if;

end process;

1.11.1.8 “wait” Inside of a “for loop”

wait statements in afor loop (UNSYNTHESIZABLE)

processbegin

for i in 0 to 7 loopwait until rising_edge(clk);x <= to_unsigned(i,4);

end loop;end process;

Reason: Unknown.while-loop s with the same behaviour are synthesizable.

Note: Combinational for-loops Combinationalfor-loops are usuallysynthesizable. They are often used to build a combinationalcircuit for eachelement of an array.

Note: Clocked for-loops Clockedfor-loops are not synthesizable,but are very useful in simulation, particular to generate test vectors for testbenches.

76 CHAPTER 1. VHDL

Synthesizable Alternative to Wait-Inside-For . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

while loop (synthesizable)

This is the synthesizable alternative to the the wait statement in a for loop above.

processbegin

-- output values from 0 to 4 on i-- sending one value out each clock cyclei <= to_unsigned(0,4);wait until rising_edge(clk);while (4 > i) loop

i <= i + 1;wait until rising_edge(clk);


1.11.2 Synthesizable, but Bad Coding Practices

Note: For some of the results in this section, the results are highly depen-dent upon the synthesis tool that you use and the target technology library.

1.11.2.1 Asynchronous Reset

In an asynchronous reset, the test for reset occursoutsideof the test for the clock edge.

process (reset, clk)begin


elsif rising_edge(clk) thenq <= d1;

end if;end process;

Asynchronous resets are bad, because if a reset occurs very close to a clock edge, some parts ofthe circuit might be reset in one clock cycle and some in the subsequent clock cycle. This can leadthe circuit to be out of sync as it goes through the reset sequence, potentially causing erroneousinternal state and output values.

1.11.2 Synthesizable, but Bad Coding Practices 77

1.11.2.2 Combinational “if-then” Without “else”

process (a, b)begin

if (a = ’1’) thenc <= b;

end if;end process;

Reason: This code synthesizesc to be a latch, and latches are undesirable.

1.11.2.3 Bad Form of Nested Ifs

if rising edge statement inside anotherif (BAD HARDWARE)

In Synopsys, with some target libraries, this design results in a level-sensitive latch whose input isa flop.

process (ce, clk)begin

if (ce = ’1’) thenif rising_edge(clk) then

q <= d1;end if;

end if;end process;

1.11.2.4 Deeply Nested Ifs

Deeply chainedif-then-else statements can lead to long chains of dependent gates, ratherthan checking different cases in parallel.

Slow (maybe)if cond1 thenstmts1

elsif cond2 thenstmts2



end if;

Fast (hopefully)if only one of the conditions can be true at atime, then try using a case statement or someother technique that allows the conditions tobe evaluated in parallel.

78 CHAPTER 1. VHDL

1.11.3 Synthesizable, but Unpredictable Hardware

Some coding styles are synthesizable andmightproduce desirable hardware with a particular syn-thesis tool, but either be unsynthesizable or produce undesirable hardware with another tool.• variables

• level-sensitive wait statements

• missing signals in sens list

If you are using a single synthesis tool for an extended period of time, and want to get the fullpower of the tool, then it can be advantageous to write your code in a way that works for your tool,but might produce undesirable results with other tools.

1.12 Synthesizable VHDL Coding Guidelines

This section gives guidelines for building robust, portable, and synthesizable VHDL code. Porta-bility is both for different simulation and synthesis toolsand for different implementation tech-nologies.

Remember, there is a world of difference between getting a design to work in simulation andgetting it to work on a real FPGA. And there is also a huge difference between getting a designto work in an FPGA for a few minutes of testing and getting thousands of products to work formonths at a time in thousands of different environments around the world.

The coding guidelines here are designed both for helping youto get your E&CE 327 project towork as well as all of the subsequent industrial designs.

Finally, note that there are exceptions to every rule. You might find yourself in a circumstancewhere your particular situation (e.g. choice of tool, target technology, etc) would benefit frombending or breaking a guideline here. Within E&CE 327, of course, there won’t be any suchcircumstances.

1.12.1 Signal Declarations

• Use signals, donot use variables

reason The intention of the creators of VHDL was for signals to be wires and variables to bejust for simulation. Some synthesis tools allow some uses ofvariables, but when usingvariables, it is easy to create a design that works in simulation but not in real hardware.

• Usestd_logic signals, donot usebit or Boolean

reason std_logic is the most commonly used signal type across synthesis tools, simulationtools, and cell libraries

• Usein or out , donot useinout

reason inout signals are tri-state.

1.12.2 Flip-Flops and Latches 79

note If you have an output signal that you also want to read from, you might be tempted todeclare the mode of the signal to beinout . A better solution is to create a new, internal,signal that you both read from and write to. Then, your outputsignal can just read fromthe internal signal.

• Declare the primary inputs and outputs of chips as eitherstd logic andstd logic vector .Do not usesigned or unsigned for primary inputs or outputs.

reason Both the Altera tool Quartus and the Xilinx tool ngd2vhdl convert signed and unsignedvectors in entities into std-logic-vectors. If you want your same testbench to work for bothfunctional simulation and timing simulation, you mustnot use signed or unsigned signalsin thetop-level entity of your chip.

note Signed and unsigned signals are fine inside testbenches, fornon-top-level entities, andinside architectures. It is only the top-level entity that should not use signed or unsignedsignals.

1.12.2 Flip-Flops and Latches

• Use flops, not latches (see section 1.8.2).

• Use D-flops, not T, JK, etc (see section 1.8.2).

• For every signal in your design, know whether it should be a flip-flop or combinational. Beforesimulating your design, examine the log filee.g. LOG/dc shell.log to see if the flipflops in your circuit match your expectations, and to check that you don’t have any latches inyour design.

• Do not assign a signal to itself (e.g.a <= a; is bad). If the signal is a flop, use a chip enableto cause the signal to hold its value. If the signal is combinational, then assigning a signal toitself will cause combinational loops, which are bad.

1.12.3 Inputs and Outputs

• Put flip flops on primary inputs and outputs of a chip

reason Creates more robust implementations. Signal delays between chips are unpredictable.Signal integrity can be a problem (remember transmission lines from E&CE 324?). Puttingflip flops on inputs and outputs of chip provides clean boundaries between circuits.

note This only applies to primary inputs and outputs of a chip (thesignals in the top-levelentity). Within a chip, you should adopt a standard of putting flip-flops on either inputs oroutputs of modules. Within a chip, you do not need to put flip-flops on both inputs andoutputs.

1.12.4 Multiplexors and Tri-State Signals

• Use multiplexors, not tri-state buffers (see section 1.8.2).

80 CHAPTER 1. VHDL

1.12.5 Processes

• For a combinational process, the sensitivity list should contain all of the signals that are read inthe process.

reason Gives consistent results across different tools. Many synthesis tools will implicitlyincludeall signals that a process reads in its sensitivity list. This differs from the VHDLStandard. A tool that adheres to the standard will introducelatches if not all signals thatare read from are included in the sensitivity list.

exception In a clocked process using anif rising edge , it is acceptable to have only theclock in the sensitivity list

• For a combinational process, every signal that is assigned to, must be assigned to in every branchof if-then and case statements.

reason If a signal is not assigned a value in a path through a combinational process, then thatsignal will be a latch.

note For a clocked process, if a signal is not assigned a value in a clock cycle, then the flip-flopfor that signal will have a chip-enable pin. Chip-enable pins are fine; they are available onflip-flops in essentially every cell library.

• Each signal should be assigned to in only one process.

reason Multiple processes driving the same signal is the same as having multiple gates drivingthe same wire. This can cause contention, short circuits, and other bad things.

exception Multiple drivers are acceptable for tri-state busses or if your implementation tech-nology has wired-ANDs or wired-ORs. FPGAs don’t have wired-ANDs or wired-ORs.

• Separate unrelated signals into different processes

reason Grouping assignments to unrelated signals into a single process can complicate thecontrol circuitry for that process. Each branch in a case statement or if-then-else adds amultiplexor or chip-enable circuitry.

reason Synthesis tools generally optimize each process individually, the larger a process is, thelonger it will take the synthesis program to optimize the process. Also, larger processestend to be more complicated and can cause synthesis programsto miss helpful optimiza-tions that they would notice in smaller processes.

1.12.6 State Machines

• In a state machine, illegal and unreachable states should transition to the reset state

reason Creates more robust implementations. In the field, your circuit will be subjected toillegal inputs, voltage spikes, temperature fluctuations,clock speed variations, etc. Atsome point in time, something weird will happen that will cause it to jump into an illegalstate. Having a system reset and reboot is much better than having it generate incorrectoutputs that aren’t detected.

• If your state machine has less than 16 states, use a one-hot encoding.

1.12.7 Reset 81

reason For n states, a one-hot encoding usesn flip-flops, while a binary encoding uses log2nflip-flops. One-hot signals are simpler to decode, because only one bit must be checked todetermine if the circuit is in a particular state. For small values ofn, a one-hot signal resultsin a smaller and faster circuit. For large values ofn, the number of signals required for aone-hot design is too great of a penalty to compensate for thesimplicity of the decodingcircuitry.

note Using an enumerated type for states allows the synthesis tool to choose state encodingsthat it thinks will work well to balance area and clock speed.Quartus uses a “modifiedone-hot” encoding, where the bit that denotes the reset state is inverted. That is, when thereset bit is’0’ , the system is in the reset state and when the reset bit is a’1’ the systemis not in the reset state. The other bits have the normal polarity. The result is that when thesystem is in the reset state, all bits are’0’ and when the system is in a non-reset state, twobits are’1’ .

note Using your own encoding allows you to leverage knowledge about your design that thesynthesis tool might not be able to deduce.

1.12.7 Reset

• Include a reset signal in all clocked circuits.

reason For most implementation technologies, when you power-up the circuit, you do notknow what state it will start in. You need a reset signal to getthe circuit into a known state.

reason If something goes wrong while the circuit is running, you need a way to get it into aknown state.

• For implicit state machines (section 2.5.1.3), check for reset after every wait statement.

reason Missing a wait statement means that your circuit might not notice a reset signal, ordifferent signals could reset in different clock cycles, causing your circuit to get out ofsynch.

• Connect reset to the important control signals in the design, such as the state signal. Do not resetevery flip flop.

reason Using reset adds area and delay to a circuit. The fewer signals that need reset, thefaster and smaller your design will be.

note Connect the reset signal to critical flip-flops, such as the state signal. Datapath signalsrarely need to be reset. You do not need to reset every signal

• Usesynchronous, not asynchronous, reset

reason Creates more robust implementations. Signal propagation delays mean that asyn-chronous resets cause different parts of the circuit to be reset at different times. This canlead to glitches, which then might cause the circuit to move to an illegal state.

82 CHAPTER 1. VHDL

Covering All Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

When writing case statements or selected assignments that test the value ofstd logic signals,you will get an error unless you include a provision for non’1’ /’0’ signals.

For example:

signal t : std_logic;...case t is

when ’1’ => ...when ’0’ => ...

end case;

will result in an error message about missing cases. You mustprovide fort being’H’ , ’U’ , etc.The simplest thing to do is to make the last testwhen other .

1.13. VHDL PROBLEMS 83

1.13 VHDL Problems

P1.1 IEEE 1164

For each of the values in the list below, answer whether or notit is defined in theieee.std_logic_1164library. If it is part of the library, write a 2–3 word description of the value.

Values:’-’ , ’#’ , ’0’ , ’1’ , ’A’ , ’h’ , ’H’ , ’L’ , ’Q’ , ’X’ , ’Z’ .

P1.2 VHDL Syntax

Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.

NOTES: 1) “... ” represents a fragment of legal VHDL code.2) For full marks, if the code is illegal, you must explain why.3) The code has been written so that, if it is illegal, then it is illegal for both

simulation and synthesis.

q2a architecture main of anchiceratops issignal a, b, c : std_logic;

beginprocess begin

wait until rising_edge(c);a <= if (b = ’1’) then

...else

...end if;


q2b architecture main of tulerpeton isbegin

lab: for i in 15 downto 0 loop...

end loop;end main;

84 CHAPTER 1. VHDL

q2c architecture main of metaxygnathus issignal a : std_logic;

beginlab: if (a = ’1’) generate

...end generate;

end main;

q2d architecture main of temnospondyl iscomponent compa

port (a : in std_logic;b : out std_logic

);end component;signal p, q : std_logic;

begincoma_1 : compa

port map (a => p, b => q);...

end main;

q2e architecture main of pachyderm isfunction inv(a : std_logic)

return std_logic isbegin

return(NOT a);end inv;signal p, b : std_logic;

beginp <= inv(b => a);...

end main;

q2f architecture main of apatosaurus istype state_ty is (S0, S1, S2);signal st : state_ty;signal p : std_logic;

begincase st is

when S0 | S1 => p <= ’0’;when others => p <= ’1’;

end case;end main;

P1.3 Flops, Latches, and Combinational Circuitry 85

P1.3 Flops, Latches, and Combinational Circuitry

For each of the signalsp...z in the architecturemain of montevido , answer whether the signalis a latch, combinational gate, or flip-flop.

entity montevido isport (

a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;l : in std_logic_vector (1 downto 0);p, q, r, s, t, u, v, w, x, y, z : out std_logic

);end montevido;

architecture main of montevido issignal i, j : std_logic;

begini <= c0 XOR c1;j <= c0 XOR c1;process (a, i, j) begin

if (a = ’1’) thenp <= i AND j;

elsep <= NOT i;

end if;end process;process (a, b0, b1) begin

if rising_edge(a) thenq <= b0 AND b1;

end if;end process;

process(a, c0, c1, d0, d1, e0, e1)

beginif (a = ’1’) then

r <= c0 OR c1;s <= d0 AND d1;

elser <= e0 XOR e1;

end if;end process;

process beginwait until rising_edge(a);t <= b0 XOR b1;u <= NOT t;v <= NOT x;

end process;

process begincase l is

when "00" =>wait until rising_edge(a);w <= b0 AND b1;x <= ’0’;

when "01" =>wait until rising_edge(a);w <= ’-’;x <= ’1’;

when "1-" =>wait until rising_edge(a);w <= c0 XOR c1;x <= ’-’;

end case;end process;y <= c0 XOR c1;z <= x XOR w;

end main;

86 CHAPTER 1. VHDL

P1.4 Counting Clock Cycles

This question refers to the VHDL code shown below.

NOTES:1. “... ” represents a legal fragment of VHDL code

2. assume all signals are properly declared

3. the VHDL code is intendend to be legal, synthesizable code

4. all signals are initially’U’

P1.4 Counting Clock Cycles 87

entity bigckt isport (

a, b : in std_logic;c : out std_logic

);end bigckt;

architecture main of bigckt isbegin

process (a, b)begin

if (a = ’0’) thenc <= ’0’;

elseif (b = 1’) then

c <= ’1’else

c <= ’0’;end if;

end if;end process;

end main;

entity tinyckt isport (

clk : in std_logic;i : in std_logic;o : out std_logic

);end tinyckt;

architecture main of tinyckt iscomponent bigckt ( ... );signal ... : std_logic;

beginp0 : process begin

wait until rising_edge(clk);p0_a <= i;wait until rising_edge(clk);

end process;p1 : process begin

wait until rising_edge(clk);p1_b <= p1_d;p1_c <= p1_b;p1_d <= s2_k;

end process;

p2 : process (p1_c, p3_h, p4_i, clk) beginif rising_edge(clk) then

p2_e <= p3_h;p2_f <= p1_c = p4_i;

end if;end process;

p3 : process (i, s4_m) beginp3_g <= i;p3_h <= s4_m;

end process;

p4 : process (clk, i) beginif (clk = ’1’) then

p4_i <= i;else

p4_i <= ’0’;end if;

end process;

huge : bigckt(a => p2_e, b => p1_d, c => h_y);

s1_j <= s3_l;s2_k <= p1_b XOR i;s3_l <= p2_f;s4_m <= p2_f;

end main;

For each of the pairs of signals below, what is theminimum length of time between when a changeoccurs on the source signal and when that change affects the destination signal?

88 CHAPTER 1. VHDL

src dst Num clock cyclesi p0 ai p1 bi p1 bi p1 ci p2 ei p3 gi p4 is4 m h yp1 b p1 dp2 f s1 jp2 f s2 k

P1.5 Arithmetic Overflow

Implement a circuit to detect overflow in 8-bit signed addition.

An overflow in addition happens when the carryinto the most significant bit is different from thecarryout of the most significant bit.

When performing addition, for overflow to happen, both operands must have the same sign. Pos-itive overflow occurs when adding two positive operands results in a negative sum. Negativeoverflow occurs when adding two negative operands results ina positive sum.

P1.6 Delta-Cycle Simulation: Pong 89

P1.6 Delta-Cycle Simulation: Pong

Perform a delta-cycle simulation of the following VHDL codeby drawing a waveform diagram.

INSTRUCTIONS:

1. The simulation is to be done at the granularity of simulation-steps.

2. Show all changes to process modes and signal values.

3. Each column of the timing diagram corresponds to a simulation step that changes a signal orprocess.

4. Clearly show the beginning and end of each simulation cycle, delta cycle, and simulationround by writing in the appropriate row aB at the beginning and anE at the end of the cycleor round.

5. End your simulation just before 20 ns.

architecture main of pong_machine issignal ping_i, ping_n, pong_i, pong_n : std_logic;

begin

reset_proc: processreset <= ’1’;wait for 10 ns;reset <= ’0’;wait for 100 ns;

end process;

clk_proc: processclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;

next_proc: process (clk)begin

if rising_edge(clk) thenping_n <= ping_i;pong_n <= pong_i;

end if;end process;

comb_proc: process (pong_n, ping_n, reset)begin

if (reset = ’1’) thenping_i <= ’1’;pong_i <= ’0’;

elseping_i <= pong_n;pong_i <= ping_n;

end if;end process;

end main;

P1.7 Delta-Cycle Simulation: Baku


INSTRUCTIONS:

90 CHAPTER 1. VHDL

1. The simulation is to be done at the granularity of simulation-steps.


3. Each column of the timing diagram corresponds to a simulation step.


5. Write “t=5ns” and “t=10ns” at the top of columns where timeadvances to 5 ns and 10 ns.

6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize the signalshave completed).

7. End your simulation just before 15 ns;

entity baku isport (

clk, a, b : in std_logic;f : out std_logic

);end baku;

architecture main of baku issignal c, d, e : std_logic;

begin

proc_clk: processbegin

clk <= ’0’;wait for 10 ns;clk <= ’1’;wat for 10 ns;

end process;proc_extern : processbegin

a <= ’0’;b <= ’0’;wait for 5 ns;a <= ’1’;b <= ’1’;wait for 15 ns;

end process;

proc_1 : process (a, b, c)begin

c <= a and b;d <= a xor c;

end process;proc_2 : processbegin

e <= d;wait until rising_edge(clk);

end process;proc_3 : process (c, e) begin

f <= c xor e;end process;

end main;

P1.8 Clock-Cycle Simulation 91

P1.8 Clock-Cycle Simulation

Given the VHDL code foranapurna and waveform diagram below, answer what the values ofthe signalsy , z , andp will be at the given times.

entity anapurna isport (

clk, reset, sel : in std_logic;a, b : in unsigned(15 downto 0);p : out unsigned(15 downto 0)

);end anapurna;

architecture main of anapurna istype state_ty is (mango, guava, durian, papaya);signal y, z : unsigned(15 downto 0);signal state : state_ty;

begin

proc_herzog: processbegin

top_loop: loopwait until (rising_edge(clk));next top_loop when (reset = ’1’);state <= durian;wait until (rising_edge(clk));state <= papaya;while y < z loop

wait until (rising_edge(clk));if sel = ’1’ then

wait until (rising_edge(clk));next top_loop when (reset = ’1’);state <= mango;

end if;state <= papaya;

end loop;end loop;

end process;

proc_hillary: process (clk)begin

if rising_edge(clk) thenif (state = durian) then

z <= a;else

z <= z + 2;end if;

end if;end process;y <= b;p <= y + z;

end main;

92 CHAPTER 1. VHDL

P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl

For each of the VHDL architecturesq3a throughq3c , does the signalv have the same behaviouras it does in themain architecture ofteradactyl ?

NOTES: 1) For full marks, if the code has different behaviour, you mustexplainwhy.

2) Ignore any differences in behaviour in the first few clock cycles that iscaused by initialization of flip-flops, latches, and registers.

3) All code fragments in this question are legal, synthesizable VHDL code.

entity teradactyl isport (

a : in std_logic;v : out std_logic

);end teradactyl;architecture main of teradactyl is

signal m : std_logic;begin

m <= a;v <= m;

end main;

architecture q3a of teradactyl issignal b, c, d : std_logic;

beginb <= a;c <= b;d <= c;v <= d;

end q3a;

architecture q3b of teradactyl issignal m : std_logic;

beginprocess (a, m) begin

v <= m;m <= a;

end process;end q3b;

architecture q3c of teradactyl issignal m : std_logic;

beginprocess (a) begin

m <= a;end process;process (m) begin

v <= m;end process;

end q3c;

P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega 93

P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega

For each of the VHDL architecturesq4a throughq4c , does the signalv have the same behaviouras it does in themain architecture ofichthyostega ?




entity ichthyostega isport (

clk : in std_logic;b, c : in signed(3 downto 0);v : out signed(3 downto 0)

);end ichthyostega;

architecture main of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin

wait until (rising_edge(clk));bx <= b;cx <= c;


wait until (rising_edge(clk));if (cx > 0) then

v <= bx;else

v <= to_signed(-1, 4);end if;


architecture q4a of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin



if (cx > 0) thenwait until (rising_edge(clk));v <= bx;

elsewait until (rising_edge(clk));v <= to_signed(-1, 4);

end if;end process;

end q4a;

94 CHAPTER 1. VHDL

architecture q4b of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin

wait until (rising_edge(clk));bx <= b;cx <= c;wait until (rising_edge(clk));if (cx > 0) then

v <= bx;else



architecture q4c of ichthyostega issignal bx, cx, dx : signed(3 downto 0);

beginprocess begin



wait until (rising_edge(clk));v <= dx;

end process;dx <= bx when (cx > 0)

else to_signed(-1, 4);end q4c;

P1.11 Waveform — VHDL Behavioural Comparison 95

P1.11 Waveform — VHDL Behavioural Comparison

Answer whether each of the VHDL code fragmentsq3a throughq3d has the same behaviour asthe timing diagram.

NOTES: 1) “Same behaviour” means that the signalsa, b, andc have the same values atthe end of each clock cycle in steady-state simulation (ignore any irregularitiesin the first few clock cycles).

2) For full marks, if the code does not match, you must explain why.3) Assume that all signals, constants, variables, types, etc are properly defined

and declared.4) All of the code fragments are legal, synthesizable VHDL code.

clk

a

b

c

q3aarchitecture q3a of q3 isbegin

process begina <= ’1’;loop

wait until rising_edge(clk);a <= NOT a;

end loop;end process;b <= NOT a;c <= NOT b;

end q3a;

q3b

architecture q3b of q3 isbegin

process beginb <= ’0’;a <= ’1’;wait until rising_edge(clk);a <= b;b <= a;wait until rising_edge(clk);

end process;c <= a;

end q3b;˜

96 CHAPTER 1. VHDL

q3carchitecture q3c of q3 isbegin

process begina <= ’0’;b <= ’1’;wait until rising_edge(clk);b <= a;a <= b;wait until rising_edge(clk);

end process;c <= NOT b;

end q3c;

q3darchitecture q3d of q3 isbegin

process (b, clk) begina <= NOT b;

end process;process (a, clk) begin

b <= NOT a;end process;c <= NOT b;

end q3d;

˜

q3earchitecture q3e of q3 isbegin

processbegin

b <= ’0’;a <= ’1’;wait until rising_edge(clk);a <= c;b <= a;wait until rising_edge(clk);

end process;c <= not b;

end q3e;

q3f

architecture q3f of q3 isbegin

process begina <= ’1’;b <= ’0’;c <= ’1’;wait until rising_edge(clk);a <= c;b <= a;c <= NOT b;wait until rising_edge(clk);

end process;end q3f;˜

P1.12 Hardware — VHDL Comparison 97

P1.12 Hardware — VHDL Comparison

For each of the circuits q2a–q2d, answerwhether the signald has the same behaviouras it does in the main architecture of q2.

entity q2 isport (

a, clk, reset : in std_logic;d : out std_logic

);end q2;architecture main of q2 is

signal b, c : std_logic;begin

b <= ’0’ when (reset = ’1’)else a;


c <= b;d <= c;

end if;end process;

end main;

q2a clk

a

0

reset

d

q2b clk a

0

reset

d

q2c clk a

0

reset

d

q2d clk

a

0

reset

d

clk

98 CHAPTER 1. VHDL

P1.13 8-Bit Register

Implement an 8-bit register that has:• clock signalclk

• input data vectord

• output data vectorq

• synchronous active-high inputreset

• synchronous active-high inputenable

P1.13.1 Asynchronous Reset

Modify your design so that thereset signal is asynchronous, rather than synchronous.

P1.13.2 Discussion

Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented onan FPGA.

P1.13.3 Testbench for Register

Write a test bench to validate the functionality of the 8-bitregister with synchronous reset.

P1.14 Synthesizable VHDL and Hardware 99

P1.14 Synthesizable VHDL and Hardware

For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If thecode is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath ofthe code. If the the code is not synthesizable, explain why.

q4a

process beginwait until rising_edge(a);e <= d;wait until rising_edge(b);e <= NOT d;

end process;

q4b

process beginwhile (c /= ’1’) loop

if (b = ’1’) thenwait until rising_edge(a);e <= d;

elsee <= NOT d;

end if;end loop;e <= b;

end process;

q4c

process (a, d) begine <= d;

end process;process (a, e) begin

if rising_edge(a) thenf <= NOT e;

end if;end process;

q4d

process (a) beginif rising_edge(a) then

if b = ’1’ thene <= ’0’;

elsee <= d;

end if;end if;

end process;

100 CHAPTER 1. VHDL

q4e

process (a,b,c,d) beginif rising_edge(a) then

e <= c;else

if (b = ’1’) thene <= d;

end if;end if;

end process;

q4f

process (a,b,c) beginif (b = ’1’) then

e <= ’0’;else

if rising_edge(a) thene <= c;

end if;end if;

end process;

P1.15 Datapath Design 101

P1.15 Datapath Design

Each of the three VHDL fragments q4a–q4c, is intended to be the datapath for the same circuit.The circuit is intended to perform the following sequence ofoperations (not all operations arerequired to use a clock cycle):

• read in source and destination addresses fromi src1 ,i src2 , i dst

• read operandsop1 andop2 from memory

• compute sum of operandssum

• write sum to memory at destination addressdst

• write sum to outputo result

i_src1 i_src2 i_dst

o_resultclk

P1.15.1 Correct Implementation?

For each of the three fragments of VHDL q4a–q4c, answer whether it is a correct implementationof the datapath. If the datapath is not correct, explain why.If the datapath is correct, answer inwhich cycle you needload=’1’ .

NOTES:1. You may choose the number of clock cycles required to execute the sequence of operations.

2. The cycle in which the addresses are oni src1 , i src2 , andi dst is cycle #0.

3. The control circuitry that controls the datapath will output a signalload , which will be ’1’when the sum is to be written into memory.

4. The code fragment with the signal declaractions, connections for inputs and outputs, and theinstantiation of memory is to be used for all three code fragments q4a–q4c.

5. The memory has registered inputs and combinational (unregistered) outputs.

6. All of the VHDL is legal, synthesizable code.

102 CHAPTER 1. VHDL

-- This code is to be used for-- all three code fragments q4a--q4c.signal state : std_logic_vector(3 downto 0);signal src1, src2, dst, op1, op2, sum,

mem_in_a, mem_out_a, mem_out_b,mem_addr_a, mem_addr_b: unsigned(7 downto 0);

...process (clk)begin

if rising_edge(clk) thensrc1 <= i_src1;src2 <= i_src2;dst <= i_dst;o_result <= sum;

end if;end process;mem : ram256x16d

port map (clk => clk,i_addr_a => mem_addr_a,i_addr_b => mem_addr_b,i_we_a => mem_we,i_data_a => mem_in_a,o_data_a => mem_out_a,o_data_b => mem_out_b);

P1.15 Datapath Design 103

q4a

op1 <= mem_out_a when state = "0010"else (others => ’0’);

op2 <= mem_out_b when state = "0010"else (others => ’0’);

sum <= op1 + op2 when state = "0100"else (others => ’0’);

mem_in_a <= sum when state = "1000"else (others => ’0’);

mem_addr_a <= dst when state = "1000"else src1;

mem_we <= ’1’ when state = "1000"else ’0’;

mem_addr_b <= src2;process (clk)begin

if rising_edge(clk) thenif (load = ’1’) then

state <= "1000";else

-- rotate state vector one bit to leftstate <= state(2 downto 0) & state(3);

end if;end if;

end process;

q4b


op1 <= mem_out_a;op2 <= mem_out_b;

end if;end process;sum <= op1 + op2;mem_in_a <= sum;mem_we <= load;mem_addr_a <= dst when load = ’1’

else src1;mem_addr_b <= src2;

104 CHAPTER 1. VHDL

q4c

processbegin

wait until rising_edge(clk);op1 <= mem_out_a;op2 <= mem_out_b;sum <= op1 + op2;mem_in_a <= sum;

end process;process (load, dst, src1) begin

if load = ’1’ thenmem_addr_a <= dst;

elsemem_addr_a <= src1;

end if;end process;mem_addr_b <= src2;

P1.15.2 Smallest Area

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which willhave thesmallest area.

If you don’t have sufficient information to predict the relative areas, explain what additional infor-mation you would need to predict the area prior to synthesizing the designs.

P1.15.3 Shortest Clock Period

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which willhave theshortest clock period.

If you don’t have sufficient information to predict the relative periods, explain what additionalinformation you would need to predict the period prior to performing any synthesis or timinganalysis of the designs.

Chapter 2

RTL Design with VHDL: FromRequirements to Optimized Code

2.1 Prelude to Chapter

2.1.1 A Note on EDA for FPGAs and ASICs

The following is from John Cooley’s columnThe Industry Gadflyfrom 2003/04/30. The title ofthis article is: “The FPGA EDA Slums”.

For 2001, Dataquest reported that the ASIC market was US$16.6 billion while theFPGA market was US$2.6 billion.

What’s more interesting is that the 2001 ASIC EDA market was US$2.2 billion whilethe FPGA EDA market was US$91.1 million. Nope, that’s not a mistake. It’s ASICEDA and billion versus FPGA EDA and million. Do the math and you’ll see that forevery dollar spent on an ASIC project, roughly 12 cents of it goes to an EDA vendor.For every dollar spent on a FPGA project, roughly 3.4 cents goes to an EDA vendor.Not good.

It’s the old free milk and a cow story according to Gary Smith,the Senior EDAAnalyst at Dataquest. “Altera and Xilinx have fowled their own nest. Their free toolsspoil the FPGA EDA market,” says Gary. “EDA vendors know thatthere’s no moneyto be made in FPGA tools.”

105

106 CHAPTER 2. RTL DESIGN WITH VHDL

2.2 FPGA Background and Coding Guidelines

2.2.1 Generic FPGA Hardware

2.2.1.1 Generic FPGA Cell

“Cell” = “Logic Element” (LE) in Altera= “Configurable Logic Block” (CLB) in Xilinx

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in

2.2.2 Area Estimation

To estimate the number of FPGA cells that will be required to implement a circuit, recall that anFPGA lookup-table can implement any function with up to fourinputs and one output.

We will describe two methods to estimate the area (number of FPGA cells) required to implementa gate-level circuit:

1. Rough estimate based simply upon the number of flip-flops and primary inputs that are inthe fanin of each flip-flop.

2. A more accurate estimate, based upon greedily including as many gates as possible into eachFPGA cell.

Allocating gates to FPGA cells is a form oftechnology mapping: moving from the implementationtechnology of generic gates to the implementation technology of FPGA cells.

As with almost all other design tasks, allocating gates to cells is an NP-complete problem: the onlyway to ensure that we get the smallest design possible is to try all possible designs. To deal withNP-complete problems, design tools useheuristicsor search techniquesto explore efficiently asubset of the options and hopefully produce a design that is close to the absolute smallest. Because

2.2.2 Area Estimation 107

different synthesis tools use different heuristics and search algorithms, different tools will giveresults.

The circuitry for any flip-flop signal with up to four source flip-flops can be implemented on asingle FPGA cell. If a flip-flop signal is dependent upon five source flip-flops, then two FPGAcells are required.

Source flops/inputs Minimum cells1 12 13 14 15 26 27 28 39 310 311 4

For a single target signal, this technique gives a lower bound on the number of cells needed. Forexample, some functions of seven inputs require more than two cells. As a particular example, afour-to-one multiplexer has six inputs and requires three cells.

When dealing with multiple target signals, this technique might be an overestimate, because asingle cell can drive several other cells (common subexpression elimination).

PLA and Flop for Different Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


PLA and Flop for Same Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

CE

S

R D Q

comb_data_in

ctrl_in

carry_in

carry_out

flop_data_outcomb

comb_data_out

flop_data_in


Estimate Area for Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

To have a more accurate estimate of the area of a circuit, we begin with each flip-flop and output,then traverse backward through the fanin gathering as much combinational circuitry as possibleinto the FPGA cell. Usually, this means that we continue as long as we have four or fewer inputsto the cell. However, when traversing through some circuits, we will temporarily have five ormore signals as input, then further back in the fanin, the circuit will collapse back to less than fivesignals.

Once we can no longer include more circuitry into an FPGA cell, we start with a fresh FPGA celland continue to traverse backward through the fanin.

Many signals have more than one target, so many FPGA cells will be connected to multiple des-tinations. When choosing whether to include a gate in an FPGAcell, consider whether the gatedrives multiple targets. There are two options: include thegate in an FPGA cell that drives bothtargets, or duplicate the gate and incorporate it into two FPGA cells. The choice of which optionwill lead to the smaller circuit is dependent on the details of the design.

Question: Map the combinational circuits below onto generic FPGA cells.

a

b

c

d

z CE

S

R D Q comb

abcd

z

a

b

c

dz y

xe

f

g

h

i

CE

S

R D Q comb

CE

S

R D Q comb

xz

y

zy

abcd


a

b

c

dz

w

xe

f

g

h

i

y

CE

S

R D Q comb

CE

S

R D Q comb

CE

S

R D Q comb

xz

y

zy

abcd

bcd

w


2.2.2.1 Interconnect for Generic FPGA

Note: In these slides, the space between tightly grouped wires sometimesdisappears, making a group of wires appear to be a single large wire.

There are two types of wires that connect a cell to the rest of the chip:• General purpose interconnect (configurable, slow)

• Carry chains and cascade chains (verticaly adjacent cells,fast)

2.2.2.2 Blocks of Cells for Generic FPGA

Cells are organized into blocks. There is a great deal of interconnect (wires) between cells withina single block. In large FPGAs, the blocks are organized intolarger blocks. These large blocksmight themselves be organized into even larger blocks. Think of an FPGA as bunch of nestedfor-generate statements that replicate a single component (cell) hundreds of thousands oftimes.


Cells not used for computation can be used as “wires” to shorten length of path between cells.


2.2.2.3 Clocks for Generic FPGAs

Characteristics of clock signals:• High fanout (drive many gates)

• Long wires (destination gates scattered all over chip)Characteristics of FPGAs:• Very few gates that are large (strong) enough to support a high fanout.

• Very few wires that traverse entire chip and can be connectedto every flip-flop.

2.2.2.4 Special Circuitry in FPGAs

Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .

For more than five years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs,these circuits are called ESBs (Embedded System Blocks). These special circuits are possiblebecause many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simplycontain small chunks of SRAM.

Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the samechip as programmable hardware.

Hard SoftAltera Arm 922T with 200 MIPs Nios with ?? MIPsXilinx: Virtex-II Pro Power PC 405 with 420 D-MIPsMicroblaze with 100 D-MIPs

The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the first-generation Intel Pentium microprocessor.

Arithmetic Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers andadders.

Altera: Mercury 16×16 at 130MHzXilinx: Virtex-II Pro 18×18 at ???MHz

Using these resources can improve significantly both the area and performance of a design.

2.2.3 Generic-FPGA Coding Guidelines 115

Input / Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Recently, high-end FPGAs have started to include special circuits to increase the bandwidth ofcommunication with the outside world.

ProductAltera True-LVDS (1 Gbps)Xilinx Rocket I/O (3 Gbps)

2.2.3 Generic-FPGA Coding Guidelines

• Flip-flops are almost free in FPGAs

reason In FPGAs, the area consumed by a design is usually determinedby the amount ofcombinational circuitry, not by the number of flip-flops.

• Aim for using 80–90% of the cells on a chip.

reason If you use more than 90% of the cells on a chip, then the place-and-route programmight not be able to route the wires to connect the cells.

reason If you use less than 80% of the cells, then probably:

there are optimizations that will increase performance andstill allow the design to fiton the chip;

or you spent too much human effort on optimizing for low area;or you could use a smaller (cheaper!) chip.

exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cellsused.

• Use just one clock signal

reason If all flip-flops use the same clock, then the clock does not impose any constraints onwhere the place-and-route tool puts flip-flops and gates. If different flip-flops used differentclocks, then flip-flops that are near each other would probably be required to use the sameclock.

• Use only one edge of the clock signal

reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge flip flops, or have two different clock signals that are inverses ofeach other. Most FPGAs have only rising-edge flip flops. Thus,using both edges of aclock signal is equivalent to having two different clock signals, which is deprecated by thepreceding guideline.


2.3 Design Flow

2.3.1 Generic Design Flow

Most people agree on the general terminology and process fora digital hardware design flow.However, each book and course has its own particular way of presenting the ideas. Here we willlay out the consistent set of definitions that we will use in E&CE 327. This might be different fromwhat you have seen in other courses or on a work term. Focus on the ideas and you will be fineboth now and in the future.

The design flow presented here focuses on theartifacts that we work with, rather than the opera-tions that are performed on the artifacts. This is because the same operations can be performed atdifferent points in the design flow, while the artifacts eachhave a unique purpose.

Analyze

Modify

Analyze

Modify

Analyze

Modify

Analyze

Modify

Analyze

Modify

Requirements

Opt. RTL Code

Implementation

Hardware

DP+Ctrl Code

High-Level Model

dp/ctrlspecific

Algorithm

Figure 2.1: Generic Design Flow

2.3.2 Implementation Flows 117

Table 2.1: Artifacts in the Design Flow

Requirements Description of what the customer wantsAlgorithm Functional description of computation. Probably not syn-

thesizable. Could be a flowchart, software, diagram,mathematical equation,etc..

High-Level Model HDL code that is not necessarily synthesizable, but di-vides algorithm into signals and clock cycles. Possiblymixes datapath and control. In VHDL, could be a singleprocess that captures the behaviour of the algorithm. Usu-ally synthesizable; resulting hardware is usually big andslow compared to optimized RTL code.

Dataflow Diagram A picture that depicts the datapath computation over time,clock-cycle by clock-cycle (Section 2.6)

Hardware Block Diagram A picture that depicts the structure of the datapath: thecomponents and the connections between the compo-nents. (e.g., netlist or schematic)

State Machine A picture that depicts the behaviour of the control cir-cuitry over time (Section 2.5)

DP+Ctrl RTL code Synthesizable HDL code that separates the datapath andcontrol into separate processes and assignments.

Optimized RTL Code HDL code that has been written to meet design goals (highperformance, low power, small, etc.)

Implementation Code A collection of files that include all of the informationneeded to build the circuit: HDL program targeted fora particular implementation technology (e.g. a specificFPGA chip), constraint files, script files, etc.

Note: Recomendation Spend the time up front to plan a good design onpaper. Use dataflow diagrams and state machines to predict performance andarea. The E&CE 327 project might appear to be sufficiently small and simplethat you can go straight to RTL code. However, you will probably producea more optimal design with less effort if you explore high-level optimizationswith dataflow diagrams and state machines.

2.3.2 Implementation Flows

Synopsys Design Compiler and FPGA Compiler are general-purpose synthesis programs. Theyhave very few, if any, technology-specific algorithms. Instead, they rely on libraries to describetechnology-specific parameters of the primitive building blocks (e.g. the delay and area of individ-ual gates, PLAs, CLBs, flops, memory arrays).


Mentor Graphic’s product Leonardo Spectrum, Cadence’s product BuildGates, and Synplicity’sproduct Synplify are similar. In comparison, Avant! (Now owned by Synopsys) and Cadence sellseparate tools that do place-and-route and other low-level(physical design) tasks.

These general-purpose synthesis tools do not (generally) do the final stages of the design, such asplace-and-route and timing analysis, which are very specific to a given implementation technology.The implementation-technology-specific tools generally also produce a VHDL file that accuratelymodels the chip. We will refer to this file as the “implementation VHDL code”.

With Synopsys and the Altera tool Quartus, we compile the VHDL code into an EDIF file forthe netlist and a TCL file for the commands to Quartus. Quartusthen generates asof (SRAMObject File), which can be downloaded to an Altera SRAM-based FPGA. The extension of theimplementation VHDL file is often.vho , for “VHDL output”.

With the Synopsys and Xilinx tools, we compile VHDL code intoa Xilinx-specific design file(xnf — Xilinx netlist file). We then use the Xilinx tools to generate abit file, which can bedownloaded to a Xilinx FPGA. The name of the implementation VHDL file is often suffixed withrouted.vhd .

Terminology: “Behavioural” and “Structural” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Note: behavioural and structural modelsThe phrases “behavioural model”and “structural model” are commonly used for what we’ll call“high-levelmodels” and “synthesizable models”. In most cases, what people call struc-tural code contains both structural and behavioural code. The technically cor-rect definition of a structural model is an HDL program that contains onlycomponent instantiations and generate statements. Thus, even a program withc <= a AND b; is, strictly speaking, behavioural.

2.3.3 Design Flow: Datapath vs Control vs Storage

2.3.3.1 Classes of Hardware

Each circuit tends to be dominated by either its datapath, control (state machine) or storage (mem-ory).• Datapath

– Purpose: compute output data based on input data

– Each “parcel” of input produces one “parcel” of output

– Examples: arithmetic, decoders

2.3.3 Design Flow: Datapath vs Control vs Storage 119

• Storage

– Purpose: hold data for future use

– Data is not modified while stored

– Examples: register files, FIFO queues

• Control

– Purpose: modify internal state based on inputs, compute outputs from state and inputs

– Mostly individual signals, few data (vectors)

– Examples: bus arbiters, memory-controllers

All three classes of circuits (datapath, control, and storage) follow the same generic design flow(Figure2.1) and use dataflow diagrams, hardware block diagrams, and state machines. The differ-ences in the design flows appear in the relative amount of effort spent on each type of descriptionand the order in which the different descriptions are used. The differences are most pronouncedin the transition from the high-level model to the model thatseparates the datapath and controlcircuitry.

2.3.3.2 Datapath-Centric Design Flow

Analyze

Modify

Analyze

Modify

Block Diagram State Machine

High-Level Model

Dataflow

DP+Ctrl RTL Code

Figure 2.2: Datapath-Centric Design Flow


2.3.3.3 Control-Centric Design Flow

Analyze

Modify

Analyze

Modify

Analyze

Modify

High-Level Model

State Machine

Dataflow Diagram

Block Diagram

DP+Ctrl RTL Code

Figure 2.3: Control-Centric Design Flow

2.3.3.4 Storage-Centric Design Flow

In E&CE 327, we won’t be discussing storage-centric design.Storage-centric design differs fromdatapath- and control-centric design in that storage-centric design focusses on building many repli-cated copies of small cells.

Storage-centric designs include a wide range of circuits, from simple memory arrays to compli-cated circuits such as register files, translation lookaside buffers, and caches. The complicatedcircuits can contain large and very intricate state machines, which would benefit from some of thetechniques for control-centric circuits.

2.4 Algorithms and High-Level ModelsFor designs with significant control flow, algorithms can be described in software languages, flow-charts, abstract state machines, algorithmic state machines, etc.

For designs with trivial control flow (e.g. every parcel of input data undergoes the same computa-tion), data-dependency graphs (section 2.4.2) are a good way to describe the algorithm.

For designs with a small amount of control flow (e.g. a microprocessor, where a single decision ismade based upon the opcode) a set of data-dependency graphs is often a good choice.

2.4.1 Flow Charts and State Machines 121

Software executes in series;hardware executes in parallel

When creating an algorithmic description of your hardware design, think about how you can repre-sent parallelism in the algorithmic notation that you are using, and how you can exploit parallelismto improve the performance of your design.

2.4.1 Flow Charts and State Machines

Flow charts and various flavours of state machines are covered well in many courses. Generallyeverything that you’ve learned about these forms of description are also applicable in hardwaredesign.

In addition, you can exploit parallelism in state machine design to createcommunicating finite statemachines. A single complex state machine can be factored into multiple simple state machines thatoperate in parallel and communicate with each other.

2.4.2 Data-Dependency Graphs

In software, the expression:(((((a + b) + c) + d) + e) + f) takes the same amountof time to execute as:(a + b) + (c + d) + (e + f) .

But, remember: hardware runs in parallel. In algorithmic descriptions, parentheses can guideparallel vs serial execution.

Datadependency graphs capture algorithms of datapath-centric designs.

Datapath-centric designs have few, if any, control decisions: every parcel of input data undergroesthe same computation.

Serial Parallel(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)

a b c d e f

+

+

+

+

+

a b c d e f

+

+

+

+

+

5 adders on longest path (slower)3 adders on longest path (faster)5 adders used (equal area) 5 adders used (equal area)


2.4.3 High-Level Models

There are many different types of high-level models, depending upon the purpose of the modeland the characteristics of the design that the model describes. Some models may capture powerconsumption, others performance, others data functionality.

High-level models are used to estimate the most important design metrics very early in the designcycle. If power consumption is more important that performance, then you might write high-level models that can predict the power consumption of different design choices, but which hasno information about the number of clock cycles that a computation takes, or which predicts thelatency inaccurately. Conversely, if performance is important, you might write clock-cycle accuratehigh-level models that do not contain any information aboutpower consumption.

Conventionally, performance has been the primary design metric. Hence, high-level models thatpredict performance are more prevalent and more well understood than other types of high-levelmodels. There are many research and entrepreneurial opportunities for people who can developtools and/or languages for high-level models for estimating power, area, maximum clock speed,etc.

In E&CE 327 we will limit ourselves to the well-understood area of high-level models for perfor-mance prediction.

2.5. FINITE STATE MACHINES IN VHDL 123

2.5 Finite State Machines in VHDL

2.5.1 Introduction to State-Machine Design

2.5.1.1 Mealy vs Moore State Machines

Moore Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

• Outputs are dependent upon only the state

• No combinational paths from inputs to outputs

s0/0

s1/1 s2/0

s3/0

a !a

Mealy Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

• Outputs are dependent upon both the state and the in-puts

• Combinational paths from inputs to outputs

s0

s1 s2

s3

a/1 !a/0

/0/0

2.5.1.2 Introduction to State Machines and VHDL

A state machine is generally written as a single clocked process, or as a pair of processes, whereone is clocked and one is combinational.


Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

• Moore vs Mealy (Sections 2.5.2 and 2.5.3)

• Implicit vs Explicit (Section 2.5.1.3)

• State values in explicit state machines: Enumerated type vsconstants (Section 2.5.5.1)

• State values for constants: encoding scheme (binary, gray,one-hot, ...) (Section 2.5.5)

VHDL Constructs for State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The following VHDL control constructs are useful to steer the transition from state to state:

• if ... then ... else

• case

• for ... loop

• while ... loop

• loop

• next

• exit

2.5.1.3 Explicit vs Implicit State Machines

There are two broad styles of writing state machines in VHDL:explicit and implicit. “Explicit”and “implicit” refer to whether there is an explicit state signal in the VHDL code. Explicit statemachines have a state signal in the VHDL code. Implicit statemachines do not contain a statesignal. Instead, they use VHDL processes with multiple waitstatements to control the execution.

In the explicit style of writing state machines, each process has at most one wait statement. Forthe explicit style of writing state machines, there are two sub-categories: “current state” and “cur-rent+next state”.

In the explicit-current style of writing state machines, the state signal represents the current stateof the machine and the signal is assigned its next value in a clocked process.

In the explicit-current+next style, there is a signal for the current state and another signal for thenext state. The next-state signal is assigned its value in a combinational process or concurrent state-ment and is dependent upon the current state and the inputs. The current-state signal is assignedits value in a clocked process and is just a flopped copy of the next-state signal.

For the implicit style of writing state machines, the synthesis program adds an implicit register tohold the state signal and combinational circuitry to updatethe state signal. In Synopsys synthesistools, the state signal defined by the synthesizer is namedmultiple wait state reg .

In Mentor Graphics, the state signal is named STATEVAR

We can think of the VHDL code for implicit state machines as having zero state signals, explicit-current state machines as having one state signal (state ), and explicit-current+next state ma-chines as having two state signals (state andstate next ).

2.5.2 Implementing a Simple Moore Machine 125

As with all topics in E&CE 327, there are tradeoffs between these different styles of writing statemachines. Most books teach only the explicit-current+nextstyle. This style is the style closest tothe hardware, which means that they are more amenable to optimization through human interven-tion, rather than relying on a synthesis tool for optimization. The advantage of the implicit style isthat they are concise and readable for control flows consisting of nested loops and branches (e.g.the type of control flow that appears in software). For control flows that have less structure, itcan be difficult to write an implicit state machine. Very few books or synthesis manuals describemultiple-wait statement processes, but they are relatively well supported among synthesis tools.

Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difficult towrite some state machines with complicated control flows in an implicit style. The followingexample illustrates the point.

s0/0

s1/1

s2/0

s3/0

a

!a

!a

a

Note: The terminology of “explicit” and “implicit” is somewhat standard,in that some descriptions of processes with multiple wait statements describethe processes as having “implicit state machines”.There is no standard terminology to distinguish between thetwo explicit styles:explicit-current+next and explicit-current.

2.5.2 Implementing a Simple Moore Machine

s0/0

s1/1 s2/0

s3/0

a !aentity simple is

port (a, clk : in std_logic;z : out std_logic

);end simple;


2.5.2.1 Implicit Moore State Machine

architecture moore_implicit_v1a of simple isbegin

processbegin

z <= ’0’;wait until rising_edge(clk);if (a = ’1’) then

z <= ’1’;else

z <= ’0’;end if;wait until rising_edge(clk);z <= ’0’;wait until rising_edge(clk);

end process;end moore_implicit;

Flops 3Gates 2Delay 1 gate

s2/0

!a


2.5.2.2 Explicit Moore with Flopped Output

architecture moore_explicit_v1 of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;


if rising_edge(clk) thencase state is

when s0 =>if (a = ’1’) then

state <= s1;z <= ’1’;

elsestate <= s2;z <= ’0’;

end if;when s1 | s2 =>

state <= s3;z <= ’0’;

when s3 =>state <= s0;z <= ’1’;

end case;end if;

end process;end moore_explicit_v1;

Flops 3Gates 10Delay 3 gates


2.5.2.3 Explicit Moore with Combinational Outputs





state <= s1;else

state <= s2;end if;

when s1 | s2 =>state <= s3;

when s3 =>state <= s0;

end case;end if;

end process;z <= ’1’ when (state = s1)

else ’0’;end moore_explicit_v2;



2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment

architecture moore_explicit_v3 of simple istype state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;


if rising_edge(clk) thenstate <= state_nxt;

end if;end process;state_nxt <= s1 when (state = s0) and (a = ’1’)

else s2 when (state = s0) and (a = ’0’)else s3 when (state = s1) or (state = s2)else s0;

z <= ’1’ when (state = s1)else ’0’;

end moore_explicit_v3;

Flops 2Gates 7Delay 4

The hardware synthesized from this architecture is the sameas that synthesized frommoore explicit v2 ,which is written in the current-explicit style.


2.5.2.5 Explicit-Current+Next Moore with Combinational Process

architecture moore_explicit_v4 of simple istype state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;



end if;end process;process (state, a)begin

case state iswhen s0 =>

if (a = ’1’) thenstate_nxt <= s1;

elsestate_nxt <= s2;

end if;when s1 | s2 =>

state_nxt <= s3;when s3 =>

state_nxt <= s0;end case;

end process;z <= ’1’ when (state = s1)


For this architecture, wechange the selected assign-ment tostate into a combi-national process using a casestatement.


The hardware synthe-sized from this archi-tecture is the same asthat synthesized frommoore explicit v2 andv3 .

2.5.3 Implementing a Simple Mealy Machine 131

2.5.3 Implementing a Simple Mealy Machine

Mealy machines have a combinational path from inputs to outputs, which often violates goodcoding guidelines for hardware. Thus, Moore machines are much more common. You shouldknow how to write a Mealy machine if needed, but most of the state machines that you design willbe Moore machines.

This is the same entity as for the simple Moore state machine.The behaviour of the Mealy machineis the same as the Moore machine, except for the timing relationship between the output (z ) andthe input (a).

s0

s1 s2

s3

a/1 !a/0

/0/0

entity simple isport (

a, clk : in std_logic;z : out std_logic

);end simple;


2.5.3.1 Implicit Mealy State Machine

Note: An implicit Mealy state machine is nonsensical.

In an implicit state machine, we do not have a state signal. But, as the example below illustrates,to create a Mealy state machine we must have a state signal.

An implicit style is a nonsensical choice for Mealy state machines. Because the output is depen-dent upon the input in the current clock cycle, the output cannot be a flop. For the output to becombinational and dependent upon both the current state andthe current input, we must create astate signal that we can read in the assignment to the output.Creating a state signal obviates theadvantages of using an implicit style of state machine.

architecture implicit_mealy of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;

beginprocessbegin

state <= s0;wait until rising_edge(clk);if (a = ’1’) then

state <= s1;else

state <= s2;end if;wait until rising_edge(clk);state <= s3;wait until rising_edge(clk);

end process;z <= ’1’ when (state = s0) and a = ’1’

else ’0;end mealy_implicit;


s2

!a/0

/0

2.5.3 Implementing a Simple Mealy Machine 133

2.5.3.2 Explicit Mealy State Machine

architecture mealy_explicit of simple istype state_ty is (s0, s1, s2, s3);signal state : state_ty;




state <= s1;else

state <= s2;end if;


when others =>state <= s0;

end case;end if;

end process;z <= ’1’ when (state = s0) and a = ’1’

else ’0’;end mealy_explicit;



2.5.3.3 Explicit-Current+Next Mealy

architecture mealy_explicit_v2 of simple istype state_ty is (s0, s1, s2, s3);signal state, state_nxt : state_ty;



end if;end process;state_nxt <= s1 when (state = s0) and a = ’1’

else s2 when (state = s0) and a = ’0’else s3 when (state = s1) or (state = s2)else s0;

z <= ’1’ when (state = s0) and a = ’1’else ’0’;

end mealy_explicit_v2;


For the Mealy machine, the explicit-current+next style is smaller than the the explicit-current style.In contrast, for the Moore machine, the two styles produce exactly the same hardware.

2.5.4 Reset 135

2.5.4 Reset

All circuits should have a reset signal that puts the circuitback into a good initial state. However,not all flip flops within the circuit need to be reset. In a circuit that has a datapath and a statemachine, the state machine will probably need to be reset, but datapath may not need to be reset.

There are standard ways to add a reset signal to both explicitand implicit state machines.

It is important that reset is tested onevery clock cycle, otherwise a reset might not be noticed, oryour circuit will be slow to react to reset and could generateillegal outputs after reset is asserted.

Reset with Implicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

With an implicit state machine, we need to insert aloop in the process and test for reset after eachwait statement.

Here is the implicit Moore machine from section 2.5.2.1 withreset code added in bold.

architecture moore_implicit of simple isbegin

processbegininit : loop -- outermost loop

z <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetif (a = ’1’) then

z <= ’1’;else

z <= ’0’;end if;wait until rising_edge(clk);next init when (reset = ’1’); -- test for resetz <= ’0’;wait until rising_edge(clk);next init when (reset = ’1’); -- test for reset

end process;end moore_implicit;


Reset with Explicit State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

Reset is often easier to include in an explicit state machine, because we need only put a test forreset = ’1’ in the clocked process for the state.

The pattern for an explicit-current style of machine is:


if reset = ’1’ thenstate <= S0;

elseif ... then

state <= ...;elif ... then

... -- more tests and assignments to stateend if;

end if;end if;

end process;

Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:



if rising_edge(clk) thenif (reset = ’1’) thenstate <= s0;

elsecase state is


state <= s1;else

state <= s2;end if;


when s3 =>state <= s0;

end case;end if;

end if;end process;z <= ’1’ when (state = s1)


2.5.5 State Encoding 137

The pattern for an explicit-current+next style is:


if reset = ’1’ thenstate_cur <= reset state;

elsestate_cur <= state_nxt;

end if;end if;

end process;

2.5.5 State Encoding

When working with explicit state machines, we must address the issue of state encoding: whatbit-vector value to associate with each state?

With implicit state machines, we do not need to worry about state encoding. The synthesis programdetermines the number of states and the encoding for each state.

2.5.5.1 Constants vs Enumerated Type

Using an enumerated type, the synthesis tools chooses the encoding:

type state_ty is (s0, s1, s2, s3);signal state : state_ty;

Using constants, we choose the encoding:

type state_ty is std_logic_vector(1 downto 0);constant s0 : state_ty := "11";constant s1 : state_ty := "10";constant s2 : state_ty := "00";constant s3 : state_ty := "01";signal state : state_ty;

Providing Encodings for Enumerated Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Many synthesizers allow the user to provide hints on how to encode the states, or allow the user toprovide explicitly the desire encoding. These hints are done either through VHDLattributesor special comments in the code.


Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

When doing functional simulation with enumerated types, simulators often display waveformswith “pretty-printed” values rather than bits (e.g.s0 ands1 rather than11 and10). However,when simulating a design that has been mapped to gates, the enumerated type dissappears and youare left with just bits. If you don’t know the encoding that the synthesis tool chose, it can be verydifficult to debug the design.

However, this opens you up to potential bugs if the enumerated type you are testing grows toinclude more values, which then end up unintentionally executing yourwhen other branch,rather than having a special branch of their own in the case statement.

Unused Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

If the number of values you have in your datatype is not a powerof two, then you will have someunused values that are representable.

For example:

type state_ty is std_logic_vector(2 downto 0);constant s0 : state_ty := "011";constant s1 : state_ty := "000";constant s2 : state_ty := "001";constant s3 : state_ty := "011";constant s4 : state_ty := "101";signal state : state_ty;

This type only needs five unique values, but can represent eight different values. What should wedo with the three representable values that we don’t need? The safest thing to do is to code yourdesign so that if an illegal value is encountered, the machine resets or enters an error state.

2.5.5.2 Encoding Schemes

• Binary: Conventional binary counter.

• One-hot: Exactly one bit is asserted at any time.

• Modified one-hot: Altera’s Quartus synthesizer generates an almost-one-hot encoding where thebit representing the reset state is inverted. This means that the reset state is all’O’ s and all otherstates have two’1’ s: one for the reset state and one for the current state.

• Gray: Transition between adjacent values requires exactlyone bit flip.

• Custom: Choose encoding to simplify combinational logic for specific task.

2.6. DATAFLOW DIAGRAMS 139

Tradeoffs in Encoding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .

• Gray is good for low-power applications where consecutive data values typically differ by 1 (e.g.no random jumps).

• One-hot usually has less combinational logic and runs faster than binary for machines with upto a dozen or so states. With more than a dozen states, the extra flip-flops required by one-hotencoding become too expensive.

• Custom is great if you have lots of time and are incredibly intelligent, or have deep insight intothe guts of your design.

Note: Don’t care values When we don’t care what is the value of a signal weassign the signal’-’, which is “don’t care” in VHDL. This should allow thesynthesis tool to use whatever value is most helpful in simplifying the Booleanequations for the signal (e.g. Karnaugh maps). In the past, some groups inE&CE 327 have used’-’ quite succesfuly to decrease the area of their design.However, a few groups found that using’-’ increased the size of their design,when they were expecting it to decrease the size. So, if you are tweaking yourdesign to squeeze out the last few unneeded FPGA cells, pay close attention asto whether using’-’ hurts or helps.

2.6 Dataflow Diagrams

2.6.1 Dataflow Diagrams Overview

• Dataflow diagrams are data-dependency graphs where the computation is divided into clockcycles.

• Purpose:

– Provide a disciplined approach for designing datapath-centric circuits

– Guide the design from algorithm, through high-level models, and finally to register transferlevel code for the datapath and control circuitry.

– Estimate area and performance

– Make tradeoffs between different design options

• Background

– Based on techniques from high-level synthesis tools

– Some similarity between high-level synthesis and softwarecompilation

– Each dataflow diagram corresponds to a basic block in software compiler terminology.


a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Data-dependency graph forz = a + b + c + d + e + f

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Dataflow diagram forz = a + b + c + d + e + f

2.6.1 Dataflow Diagrams Overview 141

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

Horizontal lines mark clock cycle boundaries

The use of memory arrays in dataflow diagrams is described in section 2.11.4.


2.6.2 Dataflow Diagrams, Hardware, and Behaviour

Primary Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Dataflow Diagrami

x

Hardware

i x

Behaviourclk

α βi

x α−

Register Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .

Dataflow Diagrami

x

Hardwarei

x

Behaviourclk

α βi

x α−

Register Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Dataflow Diagrami1

x

+

i2Hardware

i2

xi1 +

Behaviourclk

α βi1

i2 α

x −

β γ

γ

α−

Combinational-Component Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

Dataflow Diagrami1

x+

i2Hardware

i2

i1 + x

Behaviourclk

α βi1

i2 α

x −

β γ

γ

α β

2.6.3 Dataflow Diagram Execution 143

2.6.3 Dataflow Diagram Execution

Execution with Registers on Both Inputs and Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

6

0 1 2 3 4 5 6

x5

Execution Without Output Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

clk

a

x1

x2

x3

x4

x5

z

0

1

2

3

4

5

0 1 2 3 4 5 6

x5


2.6.4 Performance Estimation

Performance Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

Performance ∝1

TimeExecTimeExec = Latency ×ClockPeriod

Definition Latency: Number of clock cycles from inputs to outputs. A combinationalcircuit has latency of zero. A single register has a latency of one. A chain ofnregisters has a latency ofn.

There is much more information on performance in chapter3, which is devoted to performance.

Performance of Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

• Latency: count horizontal lines in diagram

• Min clock period (Max clock speed) limited by longest path ina clock cycle

2.6.5 Area Estimation

• Maximum number ofblocks in a clock cycleis total number of thatcomponentthat are needed

• Maximum number ofsignals that cross a cycle boundaryis total number ofregisters that areneeded

• Maximum number ofunconnected signal tails in a clock cycleis total number ofinputs thatare needed

• Maximum number ofunconnected signal heads in a clock cycleis total number ofoutputsthat are needed

The information above is only forestimating the number of components that are needed. In fact,these estimates give lower bounds. There might be constraints on your design that will force youto use more components (e.g., you might need to read all of your inputs at the same time).

Implementation-technology factors, such as the relative size of registers, multiplexers, and datapathcomponents, might force you to make tradeoffs that increasethe number of datapath componentsto decrease the overall area of the circuit.

Of particular relevance to FPGAs:• With some FPGA chips, a 2:1 multiplexer has the same area as anadder.

• With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cellper bit.

2.6.6 Design Analysis 145

• In FPGAs, registers are usually “free”, in that the area consumed by a circuit is limited by theamount of combinational logic, not the number of flip-flops.

In comparison, with ASICs and custom VLSI, 2:1 multiplexersare much smaller than adders, andregisters are quite expensive in area.

2.6.6 Design Analysis

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

num inputs 6num outputs 1num registers 6num adders 1min clock period delay through flop and one adderlatency 6 clock cycles

2.6.7 Area / Performance Tradeoffs

one add per clock cycle two adds per clock cyclea b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

Note: In the “Two-add” design, half of the last clock cycle is wasted.

Two Adds per Clock Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .


a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

clk

a

x1

x2

x3

x4

x5

z

0 1 2 3 4 5 6

4

x5

2.6.7 Area / Performance Tradeoffs 147

Design Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .

One add per clock cycle Two adds per clock cyclea b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

5

6x5

a b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

0

1

2

3

4

x5

inputs 6 6outputs 1 1registers 6 6adders 1 2clock period flop + 1 add flop + 2 addlatency 6 4

Question: Under what circumstances would each design option be fastest?

Answer:time = latency * clock period

compare execution times for both options

T1 = 6× (Tf +Ta)T2 = 4× (Tf +2×Ta)

One-add will be faster when T1 < T2:

6× (Tf +Ta) < 4× (Tf +2×Ta)6Tf +6Ta < 4Tf +8Ta

2Tf < 2Ta

Tf < Ta

Sanity check: If add is slower than flop, then want to minimize the number ofadds. One-add has fewer adds, so one-add will be faster when add is slowerthan flop.


2.7 Design Example: Massey

We’ll go through the following artifacts:

1. requirements

2. algorithm

3. dataflow diagram

4. high-level models

5. hardware block diagram

6. RTL code for datapath

7. state machine

8. RTL code for control

Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

1. Scheduling (allocate operations to clock cycles)

2. I/O allocation

3. First high-level model

4. Register allocation

5. Datapath allocation

6. Connect datapath components, insert muxes where needed

7. Design implicit state machine

8. Optimize

9. Design explicit-current state machine

10. Optimize

2.7.1 Requirements

Functional requirements:

• Compute the sum of six 8-bit numbers:output = a + b + c + d + e + f

• Use registers on both inputs and outputs

Performance requirements:

• Maximum clock period: unlimited

• Maximum latency: four

Cost requirements:

• Maximum of two adders

2.7.2 Algorithm 149

• Small miscellaneous hardware (e.g. muxes) is unlimited

• Maximum of three inputs and one output

• Design effort is unlimited

Note: In reality multiplexers are not free. In FPGAs, a 2:1 mux is more ex-pensive than a full-adder. A 2:1 mux has three inputs while anadder has onlytwo inputs (the carry-in and carry-out signals usually use the special “verti-cal” connections on the FPGA cell). In FPGAs, sharing an adder between twosignals can be more expensive than having two adders. In a “generic-gate”technology, a multiplexor contains three two-input gates,while a full-addercontains fourteen two-input gates.

2.7.2 Algorithm

We’ll use parentheses to group operations so as to maximize our opportunities to perform the workin parallel:

z = (a + b) + (c + d) + (e + f)

This results in the following data-dependency graph:

a b c d e f

+

+

+

+

+

2.7.3 Initial Dataflow Diagram

z

a b c d

e f+

+

+

+

+

This dataflow diagram violates the require-ment to use at most three inputs.


2.7.4 Dataflow Diagram Scheduling

We can potentially optimize the inputs, outputs, area, and performance of a dataflow diagram byrescheduling the operations, that is allocating the operations to different clock cycles.

Parallel algorithms have higher performance and greater scheduling flexibility than serial algo-rithms

Serial algorithms tend to have less area than parallel algorithms

Serial Parallel(((((a+b)+c)+d)+e)+f) (a+b)+(c+d)+(e+f)

a b c d e f

+

+

+

+

+

a b c d e f

+

+

+

+

+

2.7.4 Dataflow Diagram Scheduling 151

Scheduling to Optimize Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .Original parallel Parallel after scheduling

a b c d e f

+

+

+

+

+

a b c d

e f+

+

+

+

+


Scheduling to Optimize Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

Rescheduling the dataflow diagram from theparallel algorithm reduced the area fromthree adders to two. However, it still vio-lates the restriction of a maximum of threeinputs. We can reschedule the operations tokeep the same area, but reduce the numberof inputs.

The tradeoff is that reducing the number ofinputs causes an increase in the latency fromfour to five.

z

a b

c d

e f

+

+

+

+

+

A latency of five violates the design requirement of a maximumlatency of four clock cycles. Incomparing the dataflow diagram above with the design requirements, we notice that the require-ments allow a clock cycle that includes two additions and three inputs.


It appears that the parallel algorithm will notlead us to a design that satisfies the require-ments.

We revisit the algorithm and try a serial al-gorithm:

z = ((((a + b) + c) + d) + e) + f

The corresponding dataflow diagram isshown to the right.

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

2.7.5 Optimize Inputs and Outputs

When we rescheduled the parallel algorithm, we rescheduledthe input values. This requires rene-gotiating the schedule of input values with our environment. Sometimes the environment of ourcircuit will be willing to reschedule the inputs, but in other situations the environment will imposea non-negotiable schedule upon us.

If you are currently storing all inputs and can change environment’s behaviour to delay sendingsome inputs, then you can reduce the number of inputs and registers.

We will illustrate this on both the one-add and the two-add designs.

One-add before I/O opt One-add after I/O opta b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

a b

c

d

e

f

+

+

+

+

+

x1

x2

x3

x4

z

inputs 6 2regs 6 2

2.7.5 Optimize Inputs and Outputs 153

Two-add before I/O opt Two-add after I/O opta b c d e f

+

+

+

+

+

x1

x2

x3

x4

z

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

inputs 6 2regs 6 2

Design Comparison Between One and Two Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

One-add after I/O opt Two-add after I/O opta b

c

d

e

f

+

+

+

+

+

x1

x2

x3

x4

z

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z



Hardware Recipe for Two-Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

We return now to the two-add design, withthe dataflow diagram:

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

Based on the dataflow diagram, we can de-termine the hardware resources required forthe datapath.

Table 2.2: Hardware Recipe for Two-Add

inputs 3adders 2registers 3output 1registered inputs YESregistered outputs YESclock cycles from inputs to outputs 4

2.7.6 Input/Output Allocation

Our first step after settling on a hardware recipe is I/O allocation, because that determines theinterface between our circuit and the outside world.

From the hardware recipe, we know that we need only three inputs and one output. However, wehave six different input values. We need to allocate these input values to input signals before wecan write a high-level model that performs the computation of our design.

Based on the input and output information in the hardware recipe, we can define our entity:

entity massey isport (

clk : in std_logic;i1, i2, i3 : in unsigned(7 downto 0);o1 : out unsigned(7 downto 0)

);end massey;

2.7.6 Input/Output Allocation 155

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

+

+

i1 i2 i3

o1

i2 i3

i2

i1 i2 i3

o1

Figure 2.4: Dataflow diagram and hardware block diagram withI/O port allocation

Based upon the dataflow diagram after I/Oallocation, we can write our first high-levelmodel (hlm v1 ).

In the high-level model the entire circuit willbe implemented in a single process. Forlarger circuits it may be beneficial to haveseparate processes for different groups ofsignals.

In the high-level model, the code betweenwait statements describes the work that isdone in a clock cycle.

The hlm v1 architecture uses an implicitstate machine.

Because the process is clocked, all of thesignals that are assigned to in the process areregisters. Combinational signals would needto be done using concurrent assignments orcombinational processes.

architecture hlm_v1 of massey is... internal signal decls...process begin

wait until rising_edge(clk);a <= i1;b <= i2;c <= i3;wait until rising_edge(clk);x2 <= (a + b) + c;d <= i2;e <= i3;wait until rising_edge(clk);x4 <= (x2 + d) + e;f <= i2;wait until rising_edge(clk);z <= (x4 + f);

end process;o1 <= z;

end hlm_v1;


2.7.7 Register Allocation

The next step after I/O allocation could be either register allocation or datapath allocation. Thebenefit of doing register allocation first is that it is possible to write VHDL code after registerallocation is done but before datapath allocation is done, while the inverse (datapath done butregister allocation not done) does not make sense if writtenin a hardware description language.In this example, we will do register allocation before datapath allocation, and show the resultingVHDL code.

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

i1 i2 i3

o1

i2 i3

i2

+

+

i1 i2 i3

o1

r1 r2 r3

r2 r3

r2

r3

r1

r1 r2 r3r1

I/O Allocation

i1 ai2 b, d, fi3 c, eo1 z

Register Allocationr1 a, x2, x4r2 b, d, fr3 c, e

architecture hlm_v2 of massey is... internal signal decls...process begin

wait until rising_edge(clk);r1 <= i1;r2 <= i2;r3 <= i3;wait until rising_edge(clk);r1 <= (r1 + r2) + r3;r2 <= i2;r3 <= i3;wait until rising_edge(clk);r1 <= (r1 + r2) + r3;r2 <= i2;wait until rising_edge(clk);r3 <= (r1 + r2);

end process;o1 <= r3;

end hlm_v2;

Figure 2.5: Block diagram after I/O and register allocation

2.7.8 Datapath Allocation 157

2.7.8 Datapath Allocation

In datapath allocation, we allocate each of the data operations in the dataflow diagram to one ofthe datapath components in the hardware block diagram.

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

a1

a2

a1

a2

a1

r1 r2 r3

r2 r3

r2

r3

r1

r1

i1 i2 i3

o1

i2 i3

i2

+

+

a1

a2

r2 r3r1

i1 i2 i3

o1

I/O Allocation

i1 ai2 b, d, fi3 c, eo1 z

Register Allocationr1 a, x2, x4r2 b, d, fr3 c, e

Datapath Allocationa1 x1, x3, za2 x2, x4

architecture hlm_dp of massey is... internal signal decls...process begin

wait until rising_edge(clk);r1 <= i1;r2 <= i2;r3 <= i3;wait until rising_edge(clk);r1 <= a2;r2 <= i2;r3 <= i3;wait until rising_edge(clk);r1 <= a2;r2 <= i2;wait until rising_edge(clk);r3 <= a1;

end process;a1 <= r1 + r2;a2 <= a1 + r3;o1 <= r3;

end hlm_dp;

Figure 2.6: Block diagram after I/O, register, and datapathallocation


2.7.9 Datapath for DP+Ctrl Model

We will now evolve from an implicit state machine to an explicit state machine. The first step is tolabel the states in the dataflow diagram and then construct tables to find the values for chip-enableand mux-select signals.

a b c

d e

f

+

+

+

+

+

x1

x2

x3

x4

z

a1

a2

a1

a2

a1

r1 r2 r3

r2 r3

r2

r3

r1

r1

i1 i2 i3

o1

i2 i3

i2

S0

S1

S2

S3

S0

Datapath for DP+Ctrl Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .

r1 r2 r3S0 ce=1 , d=i1 ce=1 , d=i2 ce=1 , d=i3S1 ce=1 , d=a2 ce=1 , d=i2 ce=1 , d=i3S2 ce=1 , d=a2 ce=1 , d=i2 ce=–, d=–S3 ce=–, d=– ce=–, d=– ce=1 , d=a1

a1 a2S0 src1=–, src2=– src1=–, src2=–S1 src1=r1, src2=r2 src1=a1, src2=r3S2 src1=r1, src2=r2 src1=a1, src2=r3S3 src1=r1, src2=r2 src1=–, src2=–

Choose Don’t-Care Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

r1 r2 r3S0 ce=1, d=i1 ce=1, d=i2 ce=1, d=i3S1 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3S2 ce=1, d=a2 ce=1, d=i2 ce=1, d=i3S3 ce=1, d=a2 ce=1, d=i2 ce=1, d=a1

a1 a2S0 src1=r1, src2=r2 src1=a1, src2=r3S1 src1=r1, src2=r2 src1=a1, src2=r3S2 src1=r1, src2=r2 src1=a1, src2=r3S3 src1=r1, src2=r2 src1=a1, src2=r3

2.7.9 Datapath for DP+Ctrl Model 159

Simplify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .

r1 r2 = i2 r3S0 d=i1 d=i3S1 d=a2 d=i3S2 d=a2 d=i3S3 d=a2 d=a1

a1 a2src1=r1, src2=r2 src1=a1, src2=r3

VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

architecture explicit_v1 of massey issignaltype state_ty is std_logic_vector(3 downto 0);constant s0 : state_ty := "0001"; constant s1 : state_ty := "0 010";constant s2 : state_ty := "0100"; constant s3 : state_ty := "1 000";signal state : state_ty;

begin


------------------------ r1process (clk) begin

if rising_edge(clk) thenif state = S0 then

r_1 <= i_1;else

r_1 <= a_2;end if;

end if;end process;------------------------ r_2process (clk) begin

if rising_edge(clk) thenr_2 <= i_2;

end if;end process;

------------------------ r_3process (clk) begin


r_3 <= a_1;else

r_3 <= i_3;end if;

end if;end process

------------------------ combinational datapatha_1 <= r_1 + r_2;a_2 <= a_1 + r_3;o_1 <= r_3;

------------------------ state machineprocess (clk) begin

if rising_edge(clk) thenif reset = ’1’ then

state <= S0;else

case state iswhen S0 => state <= S1;when S1 => state <= S2;when S2 => state <= S3;when S3 => state <= S0;

end case;end if;

end if;end process;

end explicit_v1;

2.7.10 Peephole Optimizations

Peephole optimizations are localized optimizations to code, in that they affect only a few lines ofcode. In hardware design, peephole optimizations are usually done to decrease the clock period,although some optimizations might also decrease area. There are many different types of opti-mizations, and many optimizations that designers do by handare things that you might expect asynthesis tool to do automatically.

In a comparison such as:state = S0 , when we use a one-hot state encoding, we need com-pare only one of the bits of the state. The comparison can be simplified to: state(0) = ’1’ .Without this optimization, many synthesis tools will produce hardware that tests all of the bits ofthe state signal. This increases the area, because more bitsare required as inputs to the compari-son, and increases the clock period because the wider comparison leads to a tree-like structure ofcombinational logic, or an increased number of FPGA cells.

2.7.10 Peephole Optimizations 161

In this example, we will take advantage of our state encodingto optimize the code forr 1, r 3,and the state machine.

-- r_1process (clk) begin


r_1 <= i_1;else

r_1 <= a_2;end if;

end if;end process;

-- r_1 (optimized)process (clk) begin

if rising_edge(clk) thenif state(0) = ’1’ then

r_1 <= i_1;else

r_1 <= a_2;end if;

end if;end process;

The code forr 2 remains unchanged.



r_3 <= a_1;else

r_3 <= i_3;end if;

end if;end process;


if rising_edge(clk) thenif state(3) then

r_3 <= a_1;else

r_3 <= i_3;end if;

end if;end process;

-- state machineprocess (clk) begin


state <= S0;else


end case;end if;

end if;end process;

-- state machine (optimized)-- NOTE: "st" = "state"process (clk) begin


st <= S0;else

for i in 0 to 3 loopst( (i+1)mod4 ) <= st(i);

end loop;end if;

end if;


The hardware-block diagram that corresponds to the tables and VHDL code is:

+

+

a1

a2

r2 r3r1

i1 i2 i3

o1

State(1) State(2) State(3)reset

State(0)

2.8 Design Example: Vanier

We’ll go through the following artifacts:

1. requirements

2. algorithm

3. dataflow diagram

4. high-level models

5. hardware block diagram

6. RTL code for datapath

7. state machine

8. RTL code for control

Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

1. Scheduling (allocate operations to clock cycles)

2. I/O allocation

3. First high-level model

2.8.1 Requirements 163

4. Register allocation

5. Datapath allocation

6. Connect datapath components, insert muxes where needed

7. Design implicit state machine

8. Optimize

9. Design explicit-current state machine

10. Optimize

2.8.1 Requirements

• Functional requirements: compute the following formula:output = (a × d) + c + (d × b) + b

• Performance requirement:

– Max clock period: flop plus (2 adds or 1 multiply)

– Max latency: 4

• Cost requirements

– Maximum of two adders

– Maximum of two multipliers

– Unlimited registers

– Maximum of three inputs and one output

– Maximum of 5000 student-minutes of design effort

• Registered inputs and outputs

2.8.2 Algorithm

Create a data-dependency graph for the algo-rithm.NOTE: if draw data-dep graph in alphabeticalorder, it’s ugly. Lesson is to think about layoutand possibly re-do the layout to make it simpleand easy to understand before proceeding.

z

a d

+

+

+

b c


2.8.3 Initial Dataflow Diagram

Schedule operations into clock cycles. Use an“as soon as possible” schedule, obeying perfor-mance requirement of a maximum clock periodof one multiply or two additions. In this initialdiagram, we ignore the resource requirements.This allows us to establish a lower bound onthe latency, which gives us the maximum per-formance that we can hope to achieve.

z

a d

+

+

+

b c

2.8.4 Reschedule to Meet Requirements

We have four inputs, but the requirements allow a maximum of three. We need to move one inputinto the second clock cycle. We want to choose an input that can be delayed by one clock cyclewithout violating a requirement and with minimal degradation of performance (clock period andlatency).

If delaying an input by a clock cycle causes a requirement to be violated, we can often reschedulethe operations to remove the violation. So, we sometimes create an intermediate dataflow diagramthat violates a requirement, then reschedule the operations to bring the dataflow diagram back intocompliance.

The critical path is fromd andb, through a multiplier, the middle adder, the final adder, andthenout throughz . Because the inputsd andb are on the critical path, it would be preferable to chooseanother input (eithera or c ) as the input to move into the second clock cycle.

If we movec , we will move the first addition in the second clock cycle, which will force us to usethree adders, which violates our resource requirement of a maximum of two adders.

By process of elimination, we have settled onaas our input to be delayed. This causes one ofthe multiply operations to be moved into secondclock cycle, which is good because it reducesour resources from two multipliers to just one.

z

d

+

+

+

b c

a

2.8.5 Optimize Resources 165

Movinga into the second clock cycle has causeda clock period violation, because our clock pe-riod is now a register, a multiply, and an add.This forces us to add an additional clock cycle,which gives us a latency of four.

z

a

d

+

+

+

b c

2.8.5 Optimize Resources

We can exploit the additional clock cycle toreschedule our operations to reduce the numberof inputs from three to two. The disadvantage isthat we have increased the number of registersfrom four to five.

z

a

d

+

+

+

b

c

Two side comments:• Moving the second addition from the third clock cycle to the second will not improve the per-

formance or the area. The number of adders will remain at two,the number of registers willremain at five, and the clock period will remain at the maximumof a multiply or two additions.

• In hindsight, if we had chosen originally to movec , rather thana into the second clock cycle,we would likely have produced this same dataflow diagram. After movingc , we would seethe resource violation of three adders in the second clock cycle. This violation would cause usto add a third clock cycle, and given us an opportunity to movea into the second clock cycle.The lesson is that there are usually several different ways to approach a design problem, and itis infeasible to predict which approach will result in the best design. At best, we have manyheuristics, or “rules of thumb”, that give us guidelines fortechniques that usually work well.

Having finalized our input/output scheduling, we can write our entity. Note: we will add a resetsignal later, when we design the state machine to control thedatapath.


entity vanier isport (

clk : in std_logic;i_1, i_2 : in std_logic_vector(15 downto 0);o_1 : out std_logic_vector(15 downto 0)

);end vanier;

2.8.6 Assign Names to Registered Values 167

2.8.6 Assign Names to Registered Values

We must assign a name to each registered value. Optionally, we may also assign names to com-binational values. Registers require names, because in VHDL each register (except implicit stateregisters) is associated with a named signal. Combinational signals do not require names, be-cause VHDL allows anonymous (unnamed) combinational signals. For example, in the expression(a+b)+c we do not need to provide a name for the sum ofa andb.

If a single value spans multiple clock cycles, itonly needs to be named once. In our examplex 1, x 2, andx 4 each cross two boundaries.

z

a

d

+

+

+

b

c

x1 x2

x3 x4 x5

x6 x7

x8


2.8.7 Input/Output Allocation

Now that we have names for all of our registered signals, we can allocate input and output ports tosignals.

After the input and output ports have been allocated to signals, we can write our first model. Weuse an implicit state machine and define only the registered values. In each state, we define thevalues of the registered values that are computed in that state.

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

architecture hlm_v1 of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6,

x_7, x_8 : unsigned(15 downto 0);begin

process begin------------------------------wait until rising_edge(clk);------------------------------x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;------------------------------wait until rising_edge(clk);------------------------------x_8 <= x_6 + (x_4 + x_7);

end process;o_1 <= std_logic_vector(x_8);

end hlm_v1;

2.8.7 Input/Output Allocation 169

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

x1

0

1

2

3

4

x2

x3

x4

x5

x6

x7

x8

0 1 2 3 4 5

r1

r2

r3

r4

r5

0 1 2 3 4 5

i1

i2

i1

i2

The modelhlm v1 is synthesizable. If we are happy with the clock speed and area, we can stopnow! The remaining steps of the design process seek to optimize the design by reducing the areaand clock period. For area, we will reduce the number of registers, datapath components, andmultiplexers. Reducing the clock period will occur as we reduce the number of multiplexers andpotentially perform peephole (localized) optimizations,such as Boolean simplification.


2.8.8 Tangent: Combinational Outputs

To demonstrate a high-level model where the output is combinational, we modifyhlm v1 so thatthe output is combinational, rather than a register (seehlm v1c ). To make the output (x 8) com-binational, we move the assignment tox 8 out of the main clocked process and into a concurrentstatement.

architecture hlm_v1c of vanier issignal x_1, x_2, x_3, x_4, x_5, x_6, x_7

: unsigned(15 downto 0);begin

process begin------------------------------wait until rising_edge(clk);------------------------------x_1 <= unsigned(i_1);x_2 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------x_3 <= unsigned(i_1);x_4 <= x_1(7 downto 0) * x_2(7 downto 0);x_5 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------x_6 <= x_3(7 downto 0) * x_1(7 downto 0);x_7 <= x_2 + x_5;

end process;o_1 <= std_logic_vector(x_6 + (x_4 + x_7));

end hlm_v1c;

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

2.8.9 Register Allocation 171

2.8.9 Register Allocation

Our previous model (hlm v1 ) uses eight registers (x 1. . .x 8). However, our analysis of thedataflow diagrams says that we can implement the diagram withjust five registers. Also, the codefor hlm v1 contains two occurrences of the multiplication symbol (* ) and three occurrences of theaddition symbol (+). Our analysis of the dataflow diagram showed that we need only one multiplyand two adds. Inhlm v1 we are relying on the synthesis tool to recognize that even though thecode contains two multiplies and three adds, the hardware needs only one multiply and two adds.

Register allocation is the task of assigning each of our registered values to a register signal. Dat-apath allocation is the task of assigning each datapath operation to a datapath component. Onlyhigh-level synthesis tools (and software compilers) do register allocation. So, as hardware design-ers, we are stuck with the task of doing register allocation ourselves if we want to further optimizeour design. Some register-transfer-level synthesis toolsdo datapath allocation. If your synthesistool does datapath allocation, it is important to learn the idioms and limitations of the tool so thatyou can write your code in a style that allows the tool to do a good job of allocation and optimiza-tion. In most cases where area or clock speed are important design metrics, design engineers dodatapath allocation by hand or ad-hoc software and spreadsheets.

We will now step through the tasks of register allocation anddatapath allocation. In our eight-register model, each register holds a unique value — we do notreuse registers. To reduce thenumber of registers from eight to five, we will need to reuse registers, so that a register potentiallyholds different values in different clock cycles.

When doing register allocation, we assign a register to eachsignal that crosses a clock cycle bound-ary. When creating the hardware block diagram, we will need to add multiplexers to the inputs ofmodules that are connected to multiple registers. To reducethe number of multiplexers, we try toallocate the same registers to the same inputs of the same type of module. For example,x 7 is aninput to an adder, we allocater 5 to x 7, becauser 5 was also an input to an adder in anotherclock cycle. Also in the third clock cycle, we allocater 2 to x 6, because in the second clockcycle, the inputs to an adder werer 2 andr 5. In the last clock cycle, we allocater 5 to x 8,because previouslyr 5 was used as the output ofr 2 + r 5.

We update our model to reflect register allocation, by replacing the signals for registered values(x 1. . .x 8) with the registersr 1. . .r 5.


z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

architecture hlm_v2 of vanier issignal r_1, r_2, r_3, r_4, r_5

: unsigned(15 downto 0);begin

process begin------------------------------wait until rising_edge(clk);------------------------------r_1 <= unsigned(i_1);r_2 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------r_3 <= unsigned(i_1);r_4 <= r_1(7 downto 0) * r_2(7 downto 0);r_5 <= unsigned(i_2);------------------------------wait until rising_edge(clk);------------------------------r_2 <= r_3(7 downto 0) * r_1(7 downto 0);r_5 <= r_2 + r_5;------------------------------wait until rising_edge(clk);------------------------------r_5 <= r_2 + (r_4 + r_5);

end process;o_1 <= std_logic_vector(r_5);

end hlm_v2;

Both of our models so far (hlm v1 andhlm v2 ) have used implicit state machines. The optimiza-tion from hlm v1 to hlm v2 was done to reduce the number of registers by performing registerallocation. Most of the remaining optimizations require anexplicit state machine. We will con-struct an explicit state machine using a methodical procedure that gradually adds more informationto the dataflow diagram. The first step in this procedure is to datapath allocation, which is similarto register allocation, except that we allocate datapath components to datapath operations, ratherthan allocate registers to names.

To control the datapath, we need to provide the following signals for registers and datapath com-ponents:

registers chip-enable and mux-select signals

datapath components instruction (e.g. add, sub, etc for ALUs) and mux-select

After we determine the chip-enable, mux-select, and instruction signals, and then calculate whatvalue each signal needs in each clock cycle, we can build the explicit state machine to control thedatapath.

After we build the state machine, we will add a reset to the design.

2.8.10 Datapath Allocation 173

2.8.10 Datapath Allocation

In datapath allocation, we allocate an adder (ei-thera1 or a2) to each addition operation and amultiplier (eitherm1or m2) to each multiplica-tion operation. As with register allocation, weattempt to reduce the number of multiplexerswill be required by connecting the same data-path component to the same register in multipleclock cycles.

z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

m1

m1a1

a2

a1

2.8.11 Hardware Block Diagram and State Machine

To build an explicit state machine, we first determine what states we need. In this circuit, we needfour states, one for each clock cycle in the dataflow diagram.If our algorithmic description hadincluded control flow, such as loops and branches, then it becomes more difficult to determine thestates that are needed.

We will use four states: S0..S3, where S0 corresponds to the first clock cycle (during which theinput is read) and S3 corresponds to the last clock cycle.

2.8.11.1 Control for Registers

To determine the chip enable and mux select signals for the registers, we build a table where eachstate corresponds to a row and each register corresponds to acolumn.

For each register and each state, we note whether the register loads in a new value (ce) and whatsignal is the source of the loaded data (d).

r1 r2 r3 r4 r5ce d ce d ce d ce d ce d

S0 1 i1 1 i2 – – – – – –S1 0 – 0 – 1 i1 1 m1 1 i2S2 – – 1 m1 – – 0 – 1 a1S3 – – – – – – – – 1 a1S3 – – – – – – – – 1 a1


Eliminate unnecessary chip enables and muxes.• A chip enable is needed if a register must hold a single value for multiple clock cycles (ce=0).

• A multiplexer is needed if a register loads in values from different sources in different clockcycles.

The register simplifications are as follows:

r1 Chip-enable, because S1 has ce=0. No multiplexer, because i1 is the only input.

r2 Chip-enable, because S1 has ce=0. Multiplexer to choose betweeni2 andm1.

r3 No chip enable, no multiplexer. The registerr3 simplifies to be justr3 =i1 without a mul-tiplexer or chip-enable, because there is only one state where we care about its behaviour(S1) — all of the other states are don’t cares for both chip enable and mux.

r4 Chip-enable, because S2 has ce=0. No multiplexer, because m1 is the only input.

r5 No chip-enable, because do not have any states with ce=0. Multiplexer betweeni2 anda1 .

The simplified register table is shown below. For registers that do not have multiplexers, we showtheir input on the top row. For registers that need neither a chip enable nor a mux (e.g.r3 ), wewrite the assignment in the first row and leave the other rows blank.

r1=i1 r2 r3=i1 r4=m1 r5ce ce d ce d

S0 1 1 i2 – –S1 0 0 – 1 i2S2 – 1 m1 0 a1S3 – – – – a1

The chip-enable and mux-select signals that are needed for the registers are:r1 ce , r2 ce ,r2 sel , r4 ce , andr5 sel .

2.8.11.2 Control for Datapath Components

Analogous to the table for registers, we build a table for thedatapath components. Each of ourcomponents has two inputs (src1 and src2). Each component performs a single operation (eitheraddition or multiplication), so we do not need to define operation or instruction signals for thedatapath components.

a1 a2 m1src1 src2 src1 src2 src1 src2

S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r3 r1S3 r2 a2 r4 r5 – –

2.8.11 Hardware Block Diagram and State Machine 175

Based on the table above, the addera1 will need a multiplexer for src2. The multiplierm1willneed two multiplexers: one for each input.

Note that the operands to addition and multiplication are commutative, so we can choose whichsignal goes to src1 and which to src2 so as to minimize the needfor multiplexers.

We notice that for m1, we can reduce the number of multiplexers from 2 to 1 by swapping theoperands in the second clock cycle. This makes r1 the only source of operands for the src1 input.This optimization is reflected in the table below.

a1 a2 m1src1 src2 src1 src2 src1 src2

S0 – – – – – –S1 – – – – r1 r2S2 r2 r5 – – r1 r3S3 r2 a2 r4 r5 – –

The mux-select signals for the datapath components are:a1 src2 sel andm1 src2 sel .

2.8.11.3 Control for State

We need to control the transition from one state to the next. For this example, the transition is verysimple, each state transitions to its successor:S0→ S1→ S2→ S3→ S0....

2.8.11.4 Complete State Machine Table

The state machine table is shown below. Note that the state signal is a register; the table shows thenextvalue of the signal.

r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 – – – – S1S1 0 0 – 1 i2 – r2 S2S2 – 1 m1 0 a1 r5 r3 S3S3 – – – – a1 a2 – S0

We now choose instantiations for the don’t care values so as to simplify the circuitry. Differentstate encodings will lead to different simplifications. Forfully-encoded states, Karnaugh maps arehelpful in doing simplifications. For a one-hot state encoding, it is usually better to create situationswhere conditions are based upon a single state. The reason for this heuristic with one-hot encodingswill be clear when we get toexplicit v2 .


r1 ce We first choose0 as the don’t care instantiation, because that leaves just one state wherewe need to load. Additionally, it is conceptually cleaner todo an assignment in just the oneclock cycle where we care about the value, rather thannotdo an assignment in the one clockcycle where wemusthold the value. (At the end of the don’t care allocation, we’ll revisitthis decision and change our mind.)

r2 ce We choose1 for S3, so that we have just one state where wedo notdo a load. If wehad chosen0 for r2ce in S3, we would have two states where we do a load and two wherewe do not load. If we were using fully-encoded states, this even separation might have leftus with a very nice Karnaugh map; or it might have left us with aKarnaugh map that has acheckerboard pattern, which would not simplify. This helpsillustrate why state encoding isa difficult problem.

r2 sel We choosem1arbitrarily. The choice ofi2 would have also resulted in three assign-ments from one signal and one assignment from the other signal.

r4 ce We choose0 as we did for r1ce.

r5 sel Choosea1 so that we have three assignments from the same signal and just one assign-ment from the other signal.

a1 src2 Choosea2 arbitrarily.

m1 src2 Chooser3 arbitrarily.

r1 ce (again) We examiner1 ce and r2 ce and see that if we choose1 for the don’t careinstantiation ofr1 ce , we will have the same choices for both chip enables. This willsimplify our state machine. Also,r4 ce is the negation ofr2 ce , so we can use just aninverter to controlr4 ce .

r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel stateS0 1 1 i2 0 a1 a2 r3 S1S1 0 0 m1 1 i2 a2 r2 S2S2 1 1 m1 0 a1 r5 r3 S3S3 1 1 m1 0 a1 a2 r3 S0

2.8.12 VHDL Code with Explicit State Machine

VHDL code can be written directly from the tables and the dataflow diagram that shows registerallocation, input allocation, and datapath allocation. Asa simplification, rather than write explicitsignals for the chip-enable and mux-select signals, we use select and conditional assignment state-ments that test the state in the condition.

We chose a one-hot encoding of the state, which usually results in small and fast hardware for statemachines with sixteen or fewer states.

2.8.12 VHDL Code with Explicit State Machine 177

architecture explicit_v1 of vanier issignal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);type state_ty is std_logic_vector(3 downto 0);constant s0 : state_ty := "0001";constant s1 : state_ty := "0010";constant s2 : state_ty := "0100";constant s3 : state_ty := "1000";signal state : state_ty;


begin------------------------ r_1process (clk) begin

if rising_edge(clk) thenif state != S1 then

r_1 <= i_1;end if;



if state = S0 thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;





r_4 <= m_1;end if;

end if;end process;



r_5 <= i_2;else

r_5 <= a_1;end if;

end if;end process;------------------------ combinational datapathwith state select

a1_src2 <= r_5 when S2,a_2 when others;

with state selectm1_src2 <= r_2 when S1

r_3 when others;a_1 <= a_2 + a1_src2;a_2 <= r_4 + r_5;m_1 <= r_1 * m1_src2;o_1 <= r_5;------------------------ state machineprocess (clk) begin


state <= S0;else


end case;end if;

end if;end process;----------------------

end explicit_v1;

The hardware-block diagram that corresponds to the tables and VHDL code is:


z

a

d

+

+

+

b

c

i1 i2

o1

i1 i2

x1 x2

x3 x4 x5

x6 x7

x8

r1 r2

r3 r4 r5

r2 r5

r5

m1

m1a1

a2

a1

+

+

m1

a1

a2

r1 r2 r3

r4

r5

i1 i2

S0

S1

S2

S3

S0

2.8.13 Peephole Optimizations

We will illustrate several peephole optimizations that take advantage of our state encoding.



r_1 <= i_1;end if;

end if;end process;



r_1 <= i_1;end if;

end if;end process;

Analogous optimizations can be used when comparing againstmultiple states:



if rising_edge(clk) thenif state != S1

if state = S0 thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;

end if;end process;



if state(0) = ’1’ thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;

end if;end process;

Next-state assignment for a one-hot state machine can be done with a simple shift register:

-- state machineprocess (clk) begin


state <= S0;else


end case;end if;

end if;end process;

-- state machine (optimized)-- NOTE: "st" = "state"process (clk) begin


st <= S0;else

for i in 0 to 3 loopst( (i+1) mod 4 ) <= st( i );

end loop;end if;

end if;end process;


The resulting optimized code is shown on the next page.

architecture explicit_v2 of vanier issignal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0);type state_ty is std_logic_vector(3 downto 0);constant s0 : state_ty := "0001"; constant s1 : state_ty := "0 010";constant s2 : state_ty := "0100"; constant s3 : state_ty := "1 000";signal state : state_ty;

begin------------------------ r_1process (clk) begin


r_1 <= i_1;end if;



if state(0) = ’1’ thenr_2 <= i_2;

elser_2 <= m_1;

end if;end if;





r_4 <= m_1;end if;

end if;end process;



r_5 <= i_2;else

r_5 <= a_1;end if;

end if;end process;------------------------ combinational datapatha1_src2 <= r_5 when state(2) = ’1’

else a_2;m1_src2 <= r_2 when state(1)= ’1’

else r_3;a_1 <= a_2 + a1_src2;a_2 <= r_4 * r_5;m_1 <= r_1 * m1_src2;o_1 <= r_5;------------------------ state machineprocess (clk) begin


state <= S0;else

for i in 0 to 3 loopstate( (i+1) mod 4) <=

state(i);end loop;

end if;end if;

end process;----------------------

end explicit_v1;


2.8.14 Notes and Observations

Our functional requirements were written as:

output = (a × d) + (d × b) + b + c

Alternatively, we could have achieved exactly the same functionality with the functional require-ments written as (the two statements are mathematically equivalent):

output = (a × d) + b + (d × b) + c

The naive data dependency graph for the alternative formulation is much messier than the datadependency graph for the original formulation:

Original(a× d) + (d× b) + b + c

z

a d

+

+

+

b c

Alternative(a× d) + c + (d× b) + b

z

a b

+

+ +

cd

An observation: it can be helpful to explore several equivalent formulations of the mathematicalequations while constructing the data dependency graph. A mathematical formulation that placesoccurrences of the same identifier close to each other often results in a simpler data dependencygraph. The simpler the data dependency graph, the easier it will be to identify helpful optimizationsand efficient schedules.

2.9. PIPELINING 183

2.9 PipeliningPipelining is one of the most common and most effective performance optimizations in hardware.Pipelining is used in systems ranging from simple signal-processing filters to high-performancemicroprocessors. Pipelining increases performance by overlapping the execution of multiple in-structions or parcels of data, analogous to the way that multiple cars flow through an automobileassembly line.

Pipelines are difficult to design and verify, because subtlebugs can arise from the interactionsbetween instructions flowing through the pipeline. There are intended interactions, which musthappen correctly, and there might be unintended interactions which constitute bugs. Computerarchitects categorize the interactions between instructions according to three principles: struc-tural hazards, control hazards, and data hazards. Our examples will all be pure datapath pipelineswithout any data or control dependencies between parcels ofdata. This eliminates most of thecomplexities of implementing pipelines correctly.

2.9.1 Introduction to Pipelining

Review of unpipelined dataflow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

As a quick review of an unpipelined (also called “sequential”) dataflow diagram we revisit theone-add example from section 2.6.3.

a b

c

d

e

f

+

+

+

+

+

r1

z

0

1

2

3

4

5

add1

add1

add1

add1

add1

r1 r2

r2

r1 r2

r1 r2

r1 r2

clk

a

r1

z

0 1 2 3 4 5 6

αα

α

7 8 9 10 11

ββ

βα α α α β β β

The key feature to notice, in comparison to a pipelined dataflow diagram, is that the second parcel(β) begins execution only after the first parcel (α) has finished executing.

Pipelined dataflow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .

In a pipeline, each stage is a separate circuit, in that we cannot reuse the same component in mul-tiple stages. When drawing a pipelined dataflow diagram, we effectively have multiple dataflow


diagrams: one for each stage. As a notational shorthand to avoid drawing multiple dataflow di-agrams, we introduce a new bit of notation: a double line denotes a boundary between stages.We perform scheduling, resource allocation, and all of the other design steps individually for eachstage.

Our first example of a pipelined dataflow diagram is afully pipelinedversion of the previousexample. In a fully pipelined dataflow diagram, each clock becomes a separage stage. Notationally,we simply replace the single-line clock cycle boundaries with double-line stage boundaries.

a b

c

d

e

f

+

+

+

+

+

r3

z

0

1

2

3

4

5

add1

add2

add3

add4

add5

r1 r2

r4

r5 r5

r7 r8

r9 r10

stag

e 1

stag

e 2

stag

e 3

stag

e 4

stag

e 5

clk

a

(stage1) r1

(stage2) r3

(stage3) r5

(stage4) r7

(stage5) r9

z

0 1 2 3 4 5 6

αα

αα

ααα

7 8 9 10 11

ββ

ββ

βββ

Sequential (Unpipelined) Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

The hardware for the unpipelined dataflow diagram contains two registers, one adder, a multiplexerand a state machine to control the multiplexer. When the datais produced by the adder at the endof each clock cycle, it is fed back to multiplexer as a value for the next clock cycle.

+

i2

o1

State(1) State(2) State(3)reset

State(0) State(4)

add1

i1

r1 r2

Pipelined Hardware and VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.9.1 Introduction to Pipelining 185

The hardware for the pipelined dataflow diagram contains tworegisters and one adder for eachstage. The registers and adders do the same thing in each clock cycle, so there is no need forchip-enables, multiplexers, or a state machine.

+

i2

add1

i1

r1 r2

+add2

r3 r4

i3

+add3

r5 r6

i4

+add4

r7 r8

i5

+add5

r9 r10

i6

o1

stag

e 1

stag

e 2

stag

e 3

stag

e 4

stag

e 5

-- stage 1process begin

wait until rising_edge(clk);r1 <= i1; r2 <= i2;

end process;-- stage 2process begin

wait until rising_edge(clk);r3 <= r1 + r2; r4 <= i3;







end process;-- outputo1 <= r9 + r10;

The VHDL code above is designed to be easy to read by matching the structure of the hardware.An alternative style is to be more concise by grouping all of the registered assignments in a singleclocked process as shown below. The two styles are equivalent with respect to simulation andsynthesis.

-- group all registered assignments into a single processprocess begin

wait until rising_edge(clk);r1 <= i1; r2 <= i2;r3 <= r1 + r2; r4 <= i3;r5 <= r3 + r4; r6 <= i4;r7 <= r5 + r6; r8 <= i5;r9 <= r7 + r8; r10 <= i6;

end process;o1 <= r9 + r10;


2.9.2 Partially Pipelined

The previous section illustrated afully pipelinedcircuit, which means that the circuit could accepta new parcelevery clock cycle. Sometimes we want to sacrifice performance (throughput) inorderto reduce area. We can do this by having a throughput that is less than one parcel per clock-cycleand reusing some hardware. A pipeline that has athroughput of less than oneis said to bepartiallypipelined.

If a pipeline is essentially two pipelines running in parallel, then it is said to besuperscalarandwill usually have a throughput that ismore than one parcel per clock cycle. A superscalar pipelinethat hasn pipelines in parallel is said to ben-way superscalar and has a maximum throughput ofnparcels per clock cycle.

a b

c

d

e

f

+

+

+

+

+

r1

z

0

1

2

3

4

5

add1

add1

add2

add2

add3

r1 r2

r2

r3 r4

r3 r4

r5 r6

stag

e 1

stag

e 2

stag

e 3

clk

a

(stage1) r1

z

0 1 2 3 4 5 6

αα α

α ααα

7 8 9 10 11

ββ β

β ββ β

β

(stage2) r3

(stage3) r5

Hardware for Partially Pipelined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

2.9.3 Terminology 187

State(1)reset

State(0)

+

i2

add1

i1

r1 r2

+

i2

add2

r3 r4

+

i2

o1

add3

r5 r6

stage 1stage 2

stage 3

2.9.3 Terminology

Definition Depth: The depth of a pipeline is the number of stages on the longestpaththrough the pipeline.

Definition Latency: The latency of a pipeline is measured the same as for anunpipelined circuit: the number of clock cycles from inputsto outputs.

Definition Throughput: The number of parcels consumed or produced per clock cycle.

Definition Upstream/downstream: Because parcels flow through the pipelineanalogously to water in a stream, the terms upstream and downstream are usedrespectively to refer to earlier and later stages in the pipeline. For example, stage1 isupstream from stage2.

Definition Bubble: When a pipe stage is empty (contains invalid data), it is said tocontain a “bubble”.


Question: How do we know whether the output of the pipeline is a bubble oris validdata?

Answer:Add one register per stage to hold valid bit. If valid=’0’; then the pipe stage

contains a bubble.

2.10 Design Example: Pipelined Massey

In this section, we revisit the Massey example from section 2.7, but now do it with a pipelinedimplementation. To allow us to implement a pipelined design, we need to relax our resourcerequirements. Originally, we were allowed two adders and three inputs. For the pipeline, we willallow ourselves six inputs and five adders. There are six input values and five additions in thedataflow diagram, so these requirements will enable us to build a fully pipelined implementation.If we were forced to reuse a component (e.g., a maximum of two adders), then we would need tobuild a partially pipelined circuit.

To stay within the normal design rules for pipelines, we willregister our inputs but not our outputs.

In summary, the requirements are:

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .

Functional requirements:

• Compute the sum of six 8-bit numbers:output = a + b + c + d + e + f

• Registered inputs, combinational outputs

Performance requirements:

• Maximum clock period: unlimited

• Maximum latency: four

Cost requirements:

• Maximum of five adders

• Small miscellaneous hardware (e.g. muxes) is unlimited

• Maximum of six inputs and one output

• Design effort is unlimited

2.10. DESIGN EXAMPLE: PIPELINED MASSEY 189

Initial Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

Our goal is to first maximize performance and then minimize area with the bounds of the require-ments. To maximize performance, we want a throughput of one and a minimum clock period.Revisiting the dataflow diagrams from the unpipelined Massey, we find the two diagrams below aspromising candidates for the pipelined Massey.

Original dataflow

z

a b c d

e f+

+

+

+

+

Final unpipelined dataflowa b c

d e

f

+

+

+

+

+

z

For the unpipelined design, we rejected the original dataflow diagram because it violated the re-source requirement of a maximum of three inputs. If we fully pipeline the design, both dataflowdiagrams will use six inputs and five adders. The first diagramuses ten registers, while the seconduses eight (remember, there is no reuse of components in a fully pipelined design). However, thefirst dataflow diagram has a shorter clock period, and so will lead higher performance. Becauseour primary goal is to maximime performance, we will pursue the first dataflow diagram.

Dataflow Diagram Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

As a variation of the first dataflow diagram, we reschedule allof inputs to be read in the first clockcycle.


Variation on original dataflow

z

a b c d e f

+

+

+

+

+

The variation has the disadvantage of using one additional register. However, it has the potentialadvantage of a simpler interface to the upstream environment, because all of the inputs are pro-vided at the same time. Conversely, this rescheduling wouldbe a disadvantage if the upstreamenvironment was optimized to take advantage of the fact thate andf are produced one clock cyclelater than the other values. We do not know anything about theupstream environment, and so willreject this variation, because it increases the number of registers that we need.

As we said before, to maximize performace, we will fully pipeline the design, so every clock cycleboundary becomes a stage boundary. At this time, we also add avalid bit to keep track of whethera stage has a bubble or valid parcel. Pipelined dataflow diagram

z

a b c d

e f+

+

+

+

+

i_valid

o_valid

VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

For this simple example, there are no further optimizations, and can write the VHDL code directlyfrom the dataflow diagram.

2.10. DESIGN EXAMPLE: PIPELINED MASSEY 191

-- stage 1process begin

wait until rising_edge(clk);r1 <= i1; r2 <= i2; r3 <= i3; r4 <= i4; v1 <= i_valid;

end process;a1 <= r1 + r2; a2 <= r3 + r4;-- stage 2process begin

wait until rising_edge(clk);r5 <= a1; r6 <= a2; r7 <= i5; r8 <= i6; v2 <= v1;

end process;a3 <= r5 + r6; a4 <= r7 + r8;-- stage 3process begin

wait until rising_edge(clk);r9 <= a3; r10 <= a4; v3 <= v2;

end process;a5 <= r9 + r10;-- outputsz <= a5;o_valid <= v3;


2.11 Memory Arrays and RTL Design

2.11.1 Memory Operations

Read of Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hardware WE

A

DI

DO a doM

clk

we

Behaviourclk

αa

αd

a

M(αa)

αd

we

do

-

-

Write to Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Hardware WE

A

DI

DO aM

clk

di

we

do

Behaviourclk

αa

αd

a

M(αa)

αd

we

di

-

-

-

do U

-

-

Dual-Port Memory with Registered Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a0M

clk

di0

we WE

A0

DI0

DO0

A1 DO1 a1 do1

do0

clk

αa

αd

a0

M(αa)

αd

we

di0

-

-

-

-

βaa1

do0

-

-

βdM(βa)

U

βddo1 -

2.11.2 Memory Arrays in VHDL 193

Sequence of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .clk

αa

γd1

a0

M(γa)

αd

we

di0

βaa1

do0

θdM(θa)

βddo1 -

γa

γd2

θa

-

-

-

γd1

θd

-

αdM(αa) -

βdM(βa)

?

2.11.2 Memory Arrays in VHDL

2.11.2.1 Using a Two-Dimensional Array for Memory

A memory array can be written in VHDL as a two-dimensional array:

subtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range <> ) of data;signal mem : data_vector(31 downto 0);

These two-dimensional arrays can be useful in high-level models and in specifications. However,it is possible to write code using a two-dimensional array that cannot be synthesized. Also, somesynthesis tools (including Synopsys Design Compiler and FPGA Compiler) will synthesize two-dimensional arraysvery inefficiently.

The example below illustrates: lack of interface protocol,combinational write, multiple writeports, multiple read ports.


architecture main of mem_not_hw issubtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range <> ) of data;signal mem : data_vector(31 downto 0);

beginy <= mem( a );mem( a ) <= b; -- comb readprocess (clk) begin

if rising_edge(clk) thenmem( c ) <= w; -- write port #1


if rising_edge(clk) thenmem( d ) <= v; -- write port #2

end if;end process;u <= mem( e ); -- read port #2

end main;

2.11.2.2 Memory Arrays in Hardware

Most simple memory arrays are single- or dual-ported, support just one write operation at a time,and have an interface protocol using a clock andwrite-enable.

WE

A

DI

DO

WE

A0

DI0

DO0

A1 DO1


2.11.2.3 VHDL Code for Single-Port Memory Array

package mem_pkg issubtype data is std_logic_vector(7 downto 0);type data_vector is array( natural range <> ) of data;

end;

entity mem isport (

clk : in std_logic;we : in std_logic -- write enablea : in unsigned(4 downto 0); -- addressdi : in data; -- data_indo : out data -- data_out

);end mem;architecture main of mem is

signal mem : data_vector(31 downto 0);begin

do <= mem( to_integer( a ) );process (clk) begin

if rising_edge(clk) thenif we = ’1’ then

mem( to_integer( a ) ) <= di;end if;

end if;end process;

end main;

The VHDL code above is accurate in its behaviour and interface, but might be synthesized asdistributed memory (a large number of flip flops in FPGA cells), which will be very large and veryslow in comparison to a block of memory.

Synopsys synthesis tools implement each bit in a two-dimensional array as a flip-flop.

Each FPGA and ASIC vendors supplies libraries of memory arrays that are smaller and faster thana two-dimensional array of flip flops. These libraries exploit specialized hardware on the chips toimplement the memory.

Note: To synthesize a reasonable implementation of a memory arraywithSynopsys, you must instantiate a vendor-supplied memory component.

Some other synthesis tools, such as Xilinx XST, can infer memory arrays from two-dimensionalarrays and synthesize efficient implementations.


Recommended Design Process with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. high-level model with two-dimensional array

2. two-dimensional array packaged inside memory entity/architecture

3. vendor-supplied component

2.11.2.4 Using Library Components for Memory

Altera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Altera uses “MegaFunctions” to implement RAM in VHDL. A MegaFunction is a black-box de-scription of hardware on the FPGA. There are tools in Quartusto generate VHDL code for RAMcomponents of different sizes. In E&CE 327 we will provide you with the VHDL code for theRAM components that you will need in Lab-3 and the Project.

The APEX20KE chips that we are using have dedicated SRAM blocks calledEmbedded SystemBlocks(ESB). Each ESB can store 2048 bits and can be configured in anyof the following sizes:

Number of Elements Word Size (bits)2048 11024 2512 4256 8128 16

Xilinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Use component instantiation to get these components

ram16x1s 16×1 single ported memoryram16x1d 16×1 dual-ported memory

Other sizes are also available, consult the datasheet for your chip.


2.11.2.5 Build Memory from Slices

If the vendor’s libraries of memory components do not include one that is the correct size for yourneeds, you can construct your own component from smaller ones.

WE

A

DI

DO

WE

A

DI

DO

NxW NxW

WriteEn

Addr

DataIn[W-1..0] DataIn[2W-1..2]

Clk

DataOut[W-1..0] DataOut[2W-1..W]

Figure 2.7: An N×2W memory from N×W components

WE

A

DI

DO

WE

A

DI

DO

NxW

NxW

WriteEn

Addr[logN-1..0]

DataIn

Clk

DataOut

Addr[logN]

10

Figure 2.8: A 2N×W memory from N×W components


A 16×4 Memory from 16×1 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

library ieee;use ieee.std_logic_1164.all;use ieee.numeric_std.all;

entity ram16x4s isport (

clk, we : in std_logic;data_in : in std_logic_vector(3 downto 0);addr : in unsigned(3 downto 0);data_out : out std_logic_vector(3 downto 0)

);end ram16x4s;

architecture main of ram16x4s iscomponent ram16x1s

port (d : in std_logic; -- data ina3, a2, a1, a0 : in std_logic; -- addresswe : in std_logic; -- write enablewclk : in std_logic; -- write clocko : out std_logic -- data out

);end component;

beginmem_gen:for i in 0 to 3 generate

ram : ram16x1sport map (

we => we,wclk => clk,------------------------------------------------ d and o are dependent on ia3 => addr(3), a2 => addr(2),a1 => addr(1), a0 => addr(0),d => data_in(i),o => data_out(i)----------------------------------------------

);end generate;

end main;

2.11.3 Data Dependencies 199

2.11.2.6 Dual-Ported Memory

Dual ported memory is similar to single ported memory, except that it allows two simultaneousreads, or a simultaneous read and write.

When doing a simultaneous read and write to the same address,the read willusually not see thedata currently being written.

Question: Why do dual-ported memories usually not support writes on both ports?

Answer:What should your memory do if you write different values to the same

address in the same clock cycle?

2.11.3 Data Dependencies

Definition of Three Types of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

There are three types of data dependencies. The names come from pipeline terminology in com-puter architecture.

M[i] :=

:= M[i]

:=

M[i]

:=

:=

M[i]:=

M[i]

:=

:=

M[i]:=

Read after Write Write after Write Write after Read(True dependency) (Load dependency) (Anti dependency)

Instructions in a program can be reordered, so long as the data dependencies are preserved.


Purpose of Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

R3 := ......

... := ... R3 ...

producer

consumer

W1

R1

R3 := ......W0

W2

WAW ordering prevents W0

from happening after W1

WAR ordering prevents W2

from happening before R1

RAW ordering prevents R1

from happening before W1

R3 := ......

Each of the three types of memory dependencies (RAW, WAW, andWAR) serves a specific purposein ensuring that producer-consumer relationships are preserved.

Ordering of Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3] M[2] M[1] M[0]30 20 10 0

M[3]C :=

21

Initial Program with Dependencies

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid Modification

M[2] := 21

M[3] 31:=

A := M[2]

B := M[0]

M[3] 32:=

M[0] 01:=

C := M[3]

Valid (or Bad?) Modification

Answer:Bad modification: M[3] := 32 must happen before C := M[3].

2.11.4 Memory Arrays and Dataflow Diagrams 201

2.11.4 Memory Arrays and Dataflow Diagrams

Legend for Dataflow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

name

name name name(rd) name(wr)

Input port Output port State signal Array read Array write

Basic Memory Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

mem(rd)

addr

data

mem

mem (anti-dependency)

mem(wr)

data addrmem

mem

data := mem[addr]; mem[addr] := data;Memory Read Memory Write

Dataflow diagrams show the dependencies between operations. The basic memory operations aresimilar, in that each arrow represents a data dependency.

There are a few aspects of the basic memory operations that are potentially surprising:• Theanti-dependencyarrow producingmemon a read.

• Reads and writes are dependent upon the entire previous value of the memory array.

• The write operation appears to produce an entire memory array, rather than just updating anindividual element of an existing array.

Normally, we think of a memory array as stationary. To do a read, an address is given to the arrayand the corresponding data is produced. In datalfow diagrams, it may be somewhat suprising tosee the read and write operations consuming and producing memory arrays.

Our goal is to support memory operations in dataflow diagrams. We want to model memory oper-ations similarly to datapath operations. When we do a read, the data that is produced is dependentupon the contents of the memory array and the address. For write operations, the apparent depen-dency on, and production of, an entire memory array is because we do not know which addressin the array will be read from or written to. The antidependency for memory reads is related toWrite-after-Read dependencies, as discussed in Section 2.11.3. There are optimizations that canbe performed when we know the address (Section 2.11.4).


Dataflow Diagrams and Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Algo: mem[wr addr] := datain;dataout := mem[rdaddr];

data_out

mem(wr)

data_in wr_addr

rd_addr

mem

mem(rd)

mem

Read after Write

Algo: mem[wr addr] := datain;dataout := mem[rdaddr];

data_out

mem(wr)

data_in wr_addrrd_addr

mem

mem(rd)

mem

Optimization when rd addr 6= wr addr

Algo: mem[wr1addr] := data1;mem[wr2addr] := data2;

mem(wr)

mem

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2

Write after Write

2.11.4 Memory Arrays and Dataflow Diagrams 203

Algo: mem[wr1addr] := data1;mem[wr2addr] := data2;

mem(wr)

mem(wr)

data1 wr1_addr

wr2_addr

mem

data2mem

Scheduling option whenwr1 addr 6= wr2 addr

Algo: rd data := mem[rdaddr];mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr

wr_addr

mem

wr_data

rd_data

Write after Read

Algo: rd data := mem[rdaddr];mem[wr addr] := wr data;

mem(wr)

mem

mem(rd)

rd_addr wr_addr

mem

wr_data

rd_data

Optimization when rd addr 6= wr addr


2.11.5 Example: Memory Array and Dataflow Diagram

M(wr)

data_in wr_addr

2

M(rd)

mem

M 21 2

M(wr)

31 3

A

0

M(rd)

B M(wr)

32 3

M(wr) 3

01 0

M(rd)

CM

M[2]

M[3]

M[3]

M[0]

:=

A

B

21

31

32

01

:=

:=

:=

M[2]

M[0]

:=

:=

M[3]C :=

1

2

3

4

5

6

7

1

2

3 4

5

6

7

Figure 2.9: Memory array example code and initial dataflow diagram

The dependency and anti-dependency arrows in dataflow diagram in Figure2.9 are based solelyupon whether an operation is a read or a write. The arrows do not take into account the addressthat is read from or written to.

In figure2.10, we have used knowledge about which addresses we are accessing to remove un-needed dependencies. These are the real dependencies and match those shown in the code fragmentfor figure2.9. In figure2.11 we have placed an ordering on the read operations and an ordering onthe write operations. The ordering is derived by obeying data dependencies and then rearrangingthe operations to perform as many operations in parallel as possible.

2.11.5 Example: Memory Array and Dataflow Diagram 205

M(wr)

2

M(rd)

M 21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

Figure 2.10: Memory array with minimal dependencies

M(wr)

2

M(rd)

M 21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 0

3

M(rd)

CM

3

2

1 1 2

34

Figure 2.11: Memory array with orderings

M(wr)

2

M(rd)

M

21 2

M(wr)

31 3

A

0

M(rd)

B

M(wr)

32 3

M(wr)

01 03

M(rd)

C M

3

2

1 1

2

3

4

Figure 2.12: Final version of Figure2.9

Put as many parallel operations into same clock cycle as allowed by resources. Preserve depenciesby putting dependent operations in separate clock cycles.


2.12 Input / Output Protocols

An important aspect of hardware design is choosing a input/output protocol that is easy to im-plement and suits both your circuit and your environment. Here are a few simple and commonprotocols.

rdy

data

ack

Figure 2.13: Four phase handshaking protocol

Used when timing of communication between producer and consumer is unpredictable. The dis-advantage is that it is cumbersome to implement and slow to execute.

clk

data

valid

Figure 2.14: Valid-bit protocol

A low overhead (both in area and performance) protocol. Consumer must always be able to acceptincoming data. Often used in pipelined circuits. More complicated versions of the protocol canhandle pipeline stalls.

clk

data_in

start

done

data_out

Figure 2.15: Start/Done protocol

A low overhead (both in area and performance) protocol. Useful when a circuit works on one pieceof data at a time and the time to compute the result is unpredictable.

2.13. EXAMPLE: MOVING AVERAGE 207

2.13 Example: Moving Average

In this section we will design a circuit that performs a moving average as it receives a stream ofdata. When each new data item is received, the output is the average of the four most recentlyreceived data.

2 3 5 6 6 0 2 2 5 3 1i_data

o_avg 4 5 4 3

Time 0 1 2 3 4 5 6 7 8 9 10

2.13.1 Requirements and Environmental Assumptions

1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid data) betweenvalid data.

2. When the input data is valid, the signali valid is asserted for exactly one clock cycle.

3. Input data will be 8-bit signed numbers.

4. When output data is ready,o valid shall be asserted.

5. The output data (o avg ) shall be the average of the four most recently received input data.Output numbers shall be truncated to integer values.

2.13.2 Algorithm

We begin by exploring the mathematical behaviour of the system. To simplify the analysis at thisabstract level, we ignore bubbles and time. We focus only thevalid data. If we had an input streamof dataxi (e.g.,xi is the value of theith valid data ofi data , the equation for the output would be:

avgi = (xi−3+xi−2+xi−1 +xi)/4

To simplify our analysis of the equation, we decompose the computation into computing the sumof the four most recent data and dividing the sum by four:

sumi = xi−3 +xi−2+xi−1 +xi

avgi = sumi/4

We look at the equation ofsumover several iterations to try to identify patterns that we can use tooptimize our design:


sum5 = x2+x3 +x4 +x5sum6 = x3+x4 +x5 +x6

sum7 = x4+x5 +x6 +x7

We see that part of the calculations that are done for indexi are the same as those fori +1:

sum5 = x2 +(x3+x4 +x5)sum6 = (x3+x4 +x5)+x6

= sum5−x2 +x6

We check a few more samples and conclude that we can generalize the above for indexi as:

sumi = sumi−1−xi−4+xi

avgi = sumi/4

The equation forsumi is dependent onxi andxi−4, therefore we need the current input value and weneed to store the four most recent input data. These four mostrecent data form asliding window:each time we receive valid data, we remove the oldest data value (xi−4) and insert the new data (xi).

Summary of system behaviour deduced from exploring requirements and algorithm:

1. Define a signalnew for the value ofi data each time thati valid is ’1’ .

2. Define a memory arrayMto store a sliding window of the four most recent values ofi data .

3. Define a signalold for the oldest data value from the sliding window.

4. Updatesumi with sumi−1 – old i + newi

Sliding Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

There are two principal ways to implement a sliding window:

shift-register Each time new data is loaded, all of the registers are loaded with the data in theregister to their right or left and the leftmost or rightmostregister is loaded with new data:R[0] = newandR[i] = R[i−1].

circular buffer Once a data value is loaded into the buffer, the data remains in the same locationuntil it is overwritten. When new a value is loaded, the new value overwrites the oldest valuein the buffer. None of the other elements in the buffer change. A state machine keeps trackof the position (address) of the oldest piece of data. The state machine increments to pointto the next register, which now holds the oldest piece of data.

2.13.2 Algorithm 209

α β δγold newM[3] M[2] M[1] M[0]

α ε

β δγ

η

ι

ζε

δγ ζε

ηδ ζε

β

γ

δ

κ

λ

ιηζε

κιηζ

ε

ζShift register

α β δγε

M[0..3]old new

α

β

γ

δ

β δγ

δγ

δ

η

ι

ζε

ε

ε

ζ

ζ η

ει

κε ζ η

ζικ ζ η

λ

Circular Buffer

The circular buffer design is usually preferable, because only one element changes value per clockcycle. This allows the buffer to implemented with a memory array rather than a set of regis-ters. Also, by having only element change value, power consumption is reduced (fewer capacitorscharging and discharging).

We have only four items to store, so we will use registers, rather than a memory array. For less thansixteen items, registers are generally cheaper. For sixteen items, the choice between registers anda memory array is highly dependent on the design goals (e.g. speed vs area) and implementationtechnology.

Now that we have designed the storage module, we see that rather than a write-enable and addresssignal, the actual signals we need are four chip-enable signals. This suggests that we should use aone-hot encoding for the index of the oldest element in the circular buffer.

Because we have a one-hot encoding for the index, we do not usenormal multiplexers to selectwhich register to read from. Normal multiplexers take a binary-encoded select signal. Instead, wewill use a 4:1 decoded mux, which is just fourAND gates followed by a 4-inputOR gate. Becausethe data is 8-bits wide, each of theAND gates and theOR gate are 8-bits wide.


CE

D Q

CE

D Q

CE

D Q

CE

D Q

d

ce[0]

ce[1]

ce[2]

ce[3]

M[0]

M[1]

M[2]

M[3]

8

q

8

8

8

8

8

we addr

idx[0]

idx[1]

idx[2]

idx[3]

Register array with chip-enables and decoded multiplexer

2.13.3 Pseudocode and Dataflow Diagrams

There are three different notations that we use to describe the behaviour of hardware systemsabstractly: mathematical equations (for datapath centricdesigns), state machines (for control-dominated designs), and pseudocode (for algorithms or designs with memory). Our pseudocode issimilar to three-address assembly code: each line of code has a target variable, an operation, andone or two operand variables (e.g.,C = A + B). The name “three address” comes from the factthat there are three addresses, or variables, in each line.

We use the three-address style of pseudocode, because each line of pseudocode then correspondsto a single datapath operation in the dataflow diagram. This gives us greater flexibility to optimizethe pseudocode by rescheduling operations.

From the three-address pseudocode, we will construct dataflow diagrams.

As an aside, in constrast to three-address languages, some assembly languages for extremely smallprocessors are limited to two addresses. The target must be the same as one of the operands (e.g.,A = A + B).

2.13.3 Pseudocode and Dataflow Diagrams 211

First Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

For the first pseudocode, we do not restrict ourselves to three-addresses. In the second version ofthe code, we decompose the first line into two separate lines that obey the three-address restriction.

Pseudo pseudocode

new = i_dataold = M[idx]sum = sum - old + newM[idx] = newidx = idx rol 1o_avg = sum/4

Real 3-address pseudocode

new = i_dataold = M[idx]tmp = sum - oldsum = tmp + newM[idx] = newidx = idx rol 1o_avg = sum/4

Data-Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

To begin to understand what the hardware might be, we draw a data-dependency graph for thepseudocode.

sum i_data

sum o_avg

(wired shift)

M idx

Rd

Wr

M idx

1tmp

new

old

Optimizing the Data-Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In our design work so far, we have ignored bubbles and time. Aswe evolve from the pseudocodeto a datadependency graph and then to a dataflow graph, we willinclude the effect of the bubblesin our analysis.


In the datadependency graph we observe that we have two arithmetic operations: subtraction andaddition. The requirements guarantee that there are at least two clock cycles of bubbles betweeneach parcel of valid data, so we have the ability to reuse hardware.

In contrast, we would not be able to reuse hardware if either we had to accept new data in eachclock cycle or we needed a fully pipelined circuit. If we had to accept new data in each clockcycle, and were not pipelined, then the work would need to be completed in a single clock cycle. Ifthe design was to be fully pipelined, then each parcel of datawould stay in each stage for exactlyone clock cycle: there would be no opportunity for a parcel tovisit a stage twice, and hence noopportunity for reuse.

For our design, where we are attempting to reuse hardware, wehypothesize that a single adder/subtracteris cheaper than a separate adder and a subtracter. We would like to combine the two lines:

tmp = sum - oldsum = tmp + new

Looking at the data-dependency graph, we see thatold is coming from memory andnew iscoming from either a register or combinational logic. We cannot allocatenew andold to thesame hardware, becausenew andold are not the same type of hardware:new is an array ofregisters andold is a register. So, we will need a multiplexer for the second operand, to choosebetween reading fromold or andnew. A multiplexer might also be required for the first operandto choose betweensum andtmp . But, both of these signals are regular signals, so we might beable to allocate bothsum and tmp to the same register or datapath output, and hence avoid amultiplexer for the first operand. We will make decide how to deal with the first operand when wedo register and datapath allocation.

We remove the need for a multiplexer for the second operand byreadingnew from memory. Toaccomplish this, we re-write the pseudocode so that we first write i data to memory, and thenreadnew from memory. The three versions of the pseudocode below showthe transformations.The datadependency graph is for the third version of the pseudocode.


Remove intermediate signaloldnew = i_datatmp = sum - M[idx]sum = tmp + newM[idx] = newidx = idx rol 1o_avg = sum/4

Optimize code byreadingnew from memorytmp = sum - M[idx]M[idx] = i_datanew = M[idx]sum = tmp + newidx = idx rol 1o_avg = sum/4

Remove intermediate signalnewtmp = sum - M[idx]M[idx] = i_datasum = tmp + M[idx]idx = idx rol 1o_avg = sum/4

Data-dependency graph after removingnew

i_data

o_avg

(wired shift)

Rd

Wr

M

1Rd

tmp

old

new

sum idx

sum M idx

Dataflow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .

To construct a dataflow diagram, we divide the data-dependency graph into clock cycles. Becausewe are using registers rather than a memory array, we can schedule the first read and first writeoperations in the same clock cycle, even though they use the same address. In contrast, withmemory arrays it generally is risky to rely on the value of theoutput data in the clock cycle inwhich we are doing a write (Section 2.11.1).

We need a second clock cycle for the second read from memory.

We now explore two options: with and without a third clock cycle; both are shown below. Thedifference between the two options is whether the signalsidx andsum refers to the output ofregisters or the combinational datapath units (sum being the output of the adder/subtracter andidx being the output of a rotation). With a latency of three clockcycles, idx is a registeredsignal. With a latency of two clock cycles,idx andsum are combinational.

It is a bit misleading to describe the rotate-left unit foridx as combinational, because it is simplya wire connecting one flip-flop to another. However, conceptually and for correct behaviour, itis helpful to think of the rotation unit as a block of combinational circuitry. This allows us todistinguish between the output of theidx register and the input to the register (which is the outputof the rotation unit). Without this distinction, we might read the wrong value ofidx and beout-of-sync by one clock cycle.


Latency of three clock cycles

sumi_data

o_avg

(wired shift)

M idx

RdWr

1Rd

S1

S2

S0

S0M sum idx

Latency of two clock cycles

sumi_data

sum o_avg

(wired shift)

M idx

RdWr

M idx

1Rd

S1

S0

S0

From a performance point of view, a latency of two is somewhatpreferable. By keeping our latencylow, there may be another module that will benefit by having anadditional clock cycle in which todo its work. The counter argument is that we have two clock cycles of bubbles, which means thatwe can tolerate a latency of up to three without a need to pipeline. We’ll be efficient engineers andtry to achieve a latency of two.

The two dataflow diagrams appear to be very similar, but in thedataflow diagram with a latency oftwo, a multiplexer will be needed for the address signal of the circular buffer. In S0, the addressinput to the circular buffer is the output of the rotator. In S1, the address is the output of a register.

To eliminate the need for a multiplexer on the address input to the circular buffer, we move therotation from S0 to S1, so that the address is always a registered signal.


Latency of two clock cycles with registered address

sumi_data

(wired shift)

idx

RdWr1

Rd

S1

S0

S0

M

sum o_avgM idx

Register and Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

Register allocation is simple:idx andsum are each allocated to registers with their same names(e.g., idx andsum) on the first clock cycle boundary. For the second boundary, we similarlyallocateidx to the registeridx . This leaves us with the registersum for the output of theadder/subtracter.

Datapath Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .

Datapath allocation is even simpler than register allocation: we have one adder/subtracter (as1)and one rotate-left (rol).


sumidx

sumi_data

(wired shift)

idx

RdWr1

Rd

as1

as1

S1

S0

S0

M

sum o_avgM idx

idxsum

rol

2.13.4 Control Tables and State Machine

From the dataflow diagram, we construct a control table. For memory (M) we need: write enable,address, and data input columns. For registers (idx , sum) we need chip enable and data inputcolumns. For datapath components we need data inputs, plus acontrol signal to determine whetheras1 does addition or subtraction. We name the signalas1.sub , where a value of true means todo a subtraction and false means do addition.

We proceed in two steps, first ignoring bubbles, then extending our design to handle bubbles.

Register control table

M idx sumwe addr d ce d ce d

S0 1 idx x 0 – 1 as1S1 0 idx – 1 rol 1 as1

Datapath control table

as1 rolsub src1 src2 src1 src2

S0 0 M sum – –S1 1 sum M idx 1

Optimized control table

M idx as1we ce sub

S0 1 1 0S1 0 0 1

Static assignments in control tableM.addr = idxM.d = xidx.d = rolsum.d = as1as1.src1 = sumas1.src2 = M

2.13.4 Control Tables and State Machine 217

Control Table and Bubbles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .

If the circuit always had valid parcels arriving in every other clock cycle, then we could proceeddirectly from our dataflow diagram and optimized control table to VHDL code. However, theindeterminate number of bubbles complicates the design of our state machine.

We add anidle modeto our state machine. The circuit is in idle mode when there isnot a validparcel in the circuit. By “idle”, we mean that all write enable signals are turned off, chip enablesignals are turned off, and the state machine does not changestate. The state machine for thecontrol table must resume in state S0 wheni valid becomes true.

In the optimized control table,sum does need a chip enable, but with the addition of ide mode, wewill need to use a chip enable withsum.

The multiplexers for the datapath components are unaffected by the addition of idle mode. Whenthe circuit is in idle mode, the registers do not load new data, and so the behaviour of the datapathcomponents is unconstrained.

The final control table is below.

Almost final control table

M idx sum as1we ce ce sub

S0 1 0 1 0S1 0 1 1 1

idle 0 0 0 –

Final control table

M idx sum as1we ce ce sub

S0 1 0 1 0S1 0 1 1 1

idle 0 0 0 0

Static assignmentsM.addr = idxM.d = xidx.d = rolsum.d = as1as1.src1 = sumas1.src2 = M

State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .

The state machine start in idle, transitions to S0 wheni valid is true, then goes to S1 in the nextclock cycle, and then goes to idle.

We will use a modified one-hot encoding and use the valid-bit signals to hold the state. From thedataflow diagram we see that the latency through the circuit is two clock cycles. We need two validbit registers and will have three valid-bit signals:i valid (input, no register needed),valid1(register),o valid (register). For the state encoding, we will usei valid andvalid1 .


i valid valid1S0 1 0S1 0 1

idle 0 0

Updating the control table to show the state encoding gives us:

Final control table with state encoding

state M idx sum as1i valid valid1 we ce ce sub

S0 1 0 1 0 1 0S1 0 1 0 1 1 1

idle 0 0 0 0 0 0

Using the state encoding and the final control table, we writeequations for the write-enable signals,chip-enable signals, and the adder/subtracter control signal.

M.we = i_valididx.ce = valid1sum.ce = i_valid OR valid1as1.sub = valid1

2.13.5 VHDL Code 219

2.13.5 VHDL Code

-- valid bitsprocess begin

wait until rising_edge(clk);valid1 <= i_valid;o_valid <= valid1;

end process;-- idxprocess begin

wait until rising_edge(clk);if reset = ’1’ then

idx <= "0001";else

if valid1 = ’1’ thenidx <= idx rol 1;

end if;end if;

end process;

-- sliding windowprocess begin

wait until rising_edge(clk);for i in 3 downto 0 loop

if (i_valid = ’1’) and (idx(i) = ’1’) thenM(i) <= i_data;

end if;end loop;

end process;mem_out <= M(0) when idx(0) = ’1’

else M(1) when idx(1) = ’1’else M(2) when idx(2) = ’1’else M(3);

-- add subadd_sub <= sum - mem_out when valid1 = ’1’

else sum + mem_out;-- sumprocess begin

wait until rising_edge(clk);if i_valid = ’1’ or valid1 = ’1’ then

sum <= add_sub;end if;

end process;

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .


i_datai_valid

valid1

add/sub

sum

o_avg(wired shift)

M

(wired shift) idx

CE

CE

CEA

o_valid

2.14. DESIGN PROBLEMS 221

2.14 Design Problems

P2.1 Synthesis

This question is about using VHDL to implement memory structures on FPGAs.

P2.1.1 Data Structures

If you have to write your own code (i.e. you do not have a library of memory components or aspecial component generation tool such as LogiBlox or CoreGen), what datastructures in VHDLwould you use when creating a register file?

P2.1.2 Own Code vs Libraries

When using VHDL for an FPGA, under what circumstances is it better to write your own VHDLcode for memory, rather than instantiate memory componentsfrom a library?

P2.2 Design Guidelines

While you are grocery shopping you encounter your co-op supervisor from last year. She’s nowforming a startup company in Waterloo that will build digital circuits. She’s writing up the de-sign guidelines that all of their projects will follow. She asks for your advice on some potentialguidelines.

What is your response to each question?What is your justification for your answer?What are the tradeoffs between the two options?

0. SampleShould all projects usesilicon chips, or should all usebiological chips, or shouldeach project choose its own technique?

Answer: All projects should use silicon based chips, because biological chips don’texist yet. The tradeoff is that if biological chips existed, they would probably con-sume less power than silicon chips.

1. Should all projects use anasynchronous resetsignal, or should all use asynchronous resetsignal, or should each project choose its own technique?

2. Should all projects uselatches, or should all projects useflip-flops, or should each projectchoose its own technique?


3. Should allchipshaveregisters on the inputs and outputsor should chips have theinputsand outputs directly connected to combinational circuitry, or should each project chooseits own technique? By “register” we mean either flip-flops or latches, based upon youranswer to the previous question. If your answer is differentfor inputs and outputs, explainwhy.

4. Should allcircuit modules on all chipshaveflip-flops on the inputs and outputsor shouldchips have theinputs and outputs directly connected to combinational circuitry , orshould each project choose its own technique? By “register”we mean either flip-flops orlatches, based upon your answer to the previous question. Ifyour answer is different forinputs and outputs, explain why.

5. Should all projects usetri-state buffers, or should all projects usemultiplexors, or shouldeach project choose its own technique?

P2.3 Dataflow Diagram Optimization

Use the dataflow diagram below to answer problems P2.3.1 and P2.3.2.

f

f

a b c

d

g

f

g

e

P2.3.1 Resource Usage

List the number of items for each resource used in the dataflowdiagram.

P2.4 Dataflow Diagram Design 223

P2.3.2 Optimization

Draw an optimized dataflow diagram that improves the performance and produces the same outputvalues. Or, if the performance cannot be improved, describethe limiting factor on the preformance.

NOTES:

• you may change the times when signals are read from the environment

• you maynot increase the resource usage (input ports, registers, output ports, f components,g components)

• you maynot increase the clock period

P2.4 Dataflow Diagram Design

Your manager has given you the task of implementing the following pseudocode in an FPGA:

if is_odd(a + d)p = (a + d) * 2 + ((b + c) - 1)/4;

elsep = (b + c) * 2 + d;

NOTES: 1) You must use registers on all input and output ports.2) p, a, b, c , andd are to be implemented as 8-bit signed signals.3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1

clock cycle.4) A 2-input 8-bit multiplier or divider takes 4 clock cycles.5) A small amount of additional circuitry (e.g. a NOT gate, anAND gate, or a

MUX) can be squeezed into the same clock cycle(s) as an ALU operation,multiply, or divide.

6) You can require that the environment provides the inputs in any order andthat it holds the input signals at the same value for multipleclock cycles.

P2.4.1 Maximum Performance

What is the minimum number of clock cycles needed to implement the pseudocode with a circuitthat has two input ports?

What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimumnumber of clock cycles that you just calculated?


P2.4.2 Minimum area

What is the minimum number of datapath storage registers (8,6, 4, and 1 bit) and clock cyclesneeded to implement the pseudocode if the circuit can have atmost one ALU, one multiplier, andone divider?

P2.5 Michener: Design and Optimization

Design a circuit named michener that performs the followingoperation:z = (a+d) + ((b -c) - 1)

NOTES:1. Optimize your design for area.

2. You may schedule the inputs to arrive at any time.

3. You may do algebraic transformations of the specification.

P2.6 Dataflow Diagrams with Memory Arrays

Component DelayRegister 5 nsAdder 25 nsSubtracter 30 nsALU with +, −, >, =,−, AND, XOR 40 nsMemory read 60 nsMemory write 60 nsMultiplication 65 ns2:1 Multiplexor 5 ns

NOTES:1. The inputs of the algorithms area andb.

2. The outputs of the algorithms arep andq.

3. You must register both your inputs and outputs.

4. You may choose to read your input data values at any time andproduce your outputs at anytime. For your inputs, you may read each value only once (i.e.the environment will not sendmultiple copies of the same value).

5. Execution time is measured from when you read your first input until the latter of producingyour last output or the completion of writing a result to memory

6. Mis an internal memory array, which must be implemented as dual-ported memory with oneread/write port and one read port.

7. Msupports synchronous write and asynchronous read.

P2.7 2-bit adder 225

8. Assume all memory address and other arithmetic calculations are within the range of repre-sentable numbers (i.e. no overflows occur).

9. If you need a circuit not on the list above, assume that its delay is 30 ns.

10. You may sacrifice area efficiency to achieve high performance, but marks will be deductedfor extra hardware that does not contribute to performance.

P2.6.1 Algorithm 1

Algorithm

q = M[b];M[a] = b;p = M[b+1] * a;

Assuminga ≤ b, draw a dataflow diagram that is optimized for the fastest overall executiontime.

P2.6.2 Algorithm 2

q = M[b];M[a] = q;p = (M[b-1]) * b) + M[b];

Assuminga > b, draw a dataflow diagram that is optimized for the fastest overall executiontime.

P2.7 2-bit adder

This question compares an FPGA and generic-gates implementation of 2-bit full adder.

P2.7.1 Generic Gates

Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.


P2.7.2 FPGA

Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for thelookup tables.

CE

S

R D Q

c_in

comb

sum[0]

CE

S

R D Q comb

a[0] b[0]

a[1] b[1]

sum[1]

c_out

carry_1

P2.8 Sketches of Problems

1. calculate resource usage for a dataflow diagram (input ports, output ports, registers, datapathcomponents)

2. calculate performance data for a dataflow diagram (clock period and number of cycles toexecute (CPI))

3. given a dataflow diagram, calculate the clock period that will result in the optimum perfor-mance

4. given an algorithm, design a dataflow diagram

5. given a dataflow diagram, design the datapath and finite state machine

6. optimize a dataflow diagram to improve performance or reduce resource usage

7. given fsm diagram, pick VHDL code that “best” implements diagram — correct behaviour,simple, fast hardware — or critique hardware

Chapter 3

Performance Analysis and Optimization

3.1 Introduction

Hennessey and Patterson’sQuantitative Computer Achitecture(textbook for E&CE 429) has goodinformation on performance. We will use some of the same definitions and formulas as Hennesseyand Patterson, but we will move away from generic definitionsof performance for computer sys-tems and focus on performance for digital circuits.

3.2 Defining Performance

Performance =WorkTime

You can double your performance by:

doing twice the work in the same amount of time

OR doing the same amount of work in half the time

Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .

Measuring time is easy, but how do we accurately measure work?

The game of benchmarketing is finding a definition of work thatmakes your system appear to getthe most work done in the least amount of time.

227

228 CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Measure of Work Measure of Performanceclock cycle MHzinstruction MIPssynthetic program Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs)real program SPECtravel 1/4 mile drag race

The Spec Benchmarks are among the most respected and accurate predictions of real-world per-formance.

Definition SPEC: Standard Performance Evaluation Corporation MISSION: “Toestablish, maintain, and endorse a standardized set of relevant benchmarks andmetrics for performance evaluation of modern computer systemshttp://www.spec.org .”

The Spec organization has different benchmarks for integersoftware, floating-point software, web-serving software, etc.

3.3 Comparing Performance

3.3.1 General Equations

Equation for “Big is n% greater thanSmall”:

n% =Big−Small

Small

For the above equation, it can be difficult to remember whether the denominator is the largernumber or the smaller number. To see whySmall is the only sensible choice, consider the situationwherea is 100% greater thanb. This means that the difference betweena and b is 100% ofsomething. Our only variables area andb. It would be nonsensical for the difference to bea,because that would mean:a−b = a. However, ifa−b = b, then fora to be 100% greater thanbsimply means thata = 2b.

Using “n% greater” formula, the phrase “The performance of Ais n% greater than the performanceof B” is:

n% =PerformanceA−PerformanceB

PerformanceB

3.3.2 Example: Performance of Printers 229

Performance is inversely proportional to time:

Performance =1

Time

Substituting the above equation into the equation for “the performance of A isn% greater than theperformance of B” gives:

n% =TimeB−TimeA

TimeA

In general, the equation for a fast system to be “n%” faster than a slow system is:

n% =TSlow −TFast

TFast

Another useful formula is the average time to do one ofk different tasks, each of which happens%i of the time and takes an amount of timeTi to do each time it is done .

TAvg =k

∑i=1

(%i)(Ti)

We can measure the performance of practically anything (cars, computers, vacuum cleaners, print-ers....)

3.3.2 Example: Performance of Printers

Black and White Colourprinter1 9ppm 6ppmprinter2 12ppm 4ppm

Question: Which printer is faster at B&W and how much faster is it?

Answer:


BW Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

n% faster =TSlow−TFast

TFast

BW1 =1

9ppm

= 0.1111min/page

BW2 =1

12ppm

= 0.0833min/page

BWFaster =TSlow−TFast

TFast

=BW1−BW2

BW2

=0.1111−0.08333

0.08333

= 33%faster

Performance for Different Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

Question: If average workload is 90% BW and 10% Colour, which printer isfasterand how much faster is it?

3.3.2 Example: Performance of Printers 231

Answer:

TAvg1 = %BW×BW1+%C×C1

= (0.90×0.1111)+(0.10×0.1667)

= 0.1167min/page

TAvg2 = %BW×BW2+%C×C2

= (0.90×0.0833)+(0.10×0.2500)

= 0.1000min/page

AvgFaster =TSlow−TFast

TFast

=Avg1−Avg2

Avg2

=0.1167−0.1000

0.1000

= 16.7%faster

Optimizing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

Question: If we want to optimize printer1 to match performance of printer2, shouldwe optimize BW or Colour printing?

Answer:

Colour printing is slower, so appears that can save more time by optimizingcolour printing.

However, look at extreme case of optimizing colour printing to beinstantaneous for P1:


0.000m/p

0.050m/p

0.100m/p

0.150m/p

P1 P2

Even if make colour printing instantaneous for printer 1 and kept same forprinter 2, printer 1 would not be measurably faster.

Amdahl’s law “Make the common case fast.”

Optimizations need to take into accountboth run time and frequency of

occurrence.

We should optimize black and white printing.

Question: If you have to fire all of the engineers because your stock price plummeted,how can you get printer1 to be faster than printer2?

Note: This question was actually humorous during the high-tech bubble of2000...

Answer:

Hire more marketing people!

Notice that colour printing on printer 1 is faster than on printer 2. So,marketing suggests that people are increasing the percentage of printing thatis done in colour.

Question: Revised question: what percentage of printing must be done in colour forprinter1 to beat printer2?

3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE 233

Answer:

TAvg1 ≤ TAvg2

%BW×BW1+%C×C1 ≤ %BW×BW2+%C×C2

%BW = 1−%C

(1−%C)×BW1+%C×C1 ≤ (1−%C)×BW2+%C×C2

BW1+%C× (C1−BW1) ≤ BW2+%C× (C2−BW2)

%C ≥BW1−BW2

BW1−BW2+C2−C1

%C ≥0.1111−0.0833

0.1111−0.0833+0.2500−0.1667

%C ≥ 0.25

3.4 Clock Speed, CPI, Program Length, and Performance

3.4.1 Mathematics

CPI Cycles per instructionNumInsts Number of instructionsClockSpeed Clock speedClockPeriod Clock period

Time = NumInsts×CPI×ClockPeriod

Time = NumInsts×CPIClockSpeed

3.4.2 Example: CISC vs RISC and CPI

Clock Speed SPECintAMD Athlon 1.1GHz 409Fujitsu SPARC64 675MHz 443


The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The FujitsuSPARC64 is a RISC microprocessor (it uses Sun’s Sparc instruction set). Assume that it requires20% more instructions to write a program in the Sparc instruction set than the same program re-quires in IA-32.

Question: Which of the two processors has higher performance?

Answer:SPECint, SPECfp, and SPEC are measures of performance. Therefore, the

higher the SPEC number, the higher the performance. The Fujitsu SPARC64has higher performance

Question: What is the ratio between the CPIs of the two microprocessors?

Answer:

We will use a as the subscript for the Athlon and s as the subscript for theSparc.

Time = NumInsts×CPIClockSpeed

CPI =Time×ClockSpeed

NumInsts

CPI =ClockSpeed

Perf×NumInsts

CPIACPIS

=

(ClockSpeedA

PerfA×NumInstsA

)

×

(PerfS×NumInstsS

ClockSpeedS

)

ClockSpeedA = 1.1ClockSpeedS = 0.675

PerfA = 409PerfS = 443

NumInstsS = 1.2×NumInstsA

=

(1.1

409×NumInstsA

)

×

(443×1.2×NumInstsA

0.675

)

= 2.1

= 110%more

3.4.3 Effect of Instruction Set on Performance 235

Executing the average Athlon instruction requires 110% more clock cyclesthan executing the average Sparc instruction.

Stated more awkwardly: executing the average Athlon instruction requires210% of the clock cycles required to execute the average Sparc instruction.

Question: Can you determine the absolute (actual) CPI of either microprocessor?

Answer:To determine the absolute CPI, we would need to know the actual number of

instructions execute by at least one of the processors.

3.4.3 Effect of Instruction Set on Performance

Your group designs a microprocessor and you are consideringadding a fused multiply-accumulateto the instruction set. (A fused multiply accumulate is a single instruction that does both a multiplyand an addition. It is often used in digital signal processing.)

Your studies have shown that, on average, half of the multiply operations are followed by an addinstruction that could be done with a fused multiply-add.

Additionally, you know:

cpi %ADD 0.8 CPIavg 15%MUL 1.2 CPIavg 5%Other 1.0 CPIavg 80%

You have three options:

option 1 : no change

option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the sameCPI as MUL.

option 3 : add the MAC instruction, keep the clock period the same, andthe CPI of a MAC is50% greater than that of a multiply.

Question: Which option will result in the highest overall performance?


Answer:

Time =NumInsts×CPI

ClockSpeed

Perf =ClockSpeed

NumInsts×CPI

We need to find NumInsts, CPI, and ClockSpeed for each of the threeoptions. Option 1 is the baseline, so we will define values for variables inOptions 2 and 3 in terms of the Option 1 variables.

Options 2 and 3 will have the same number of instructions. Half of themultiply instructions are followed by an add that can be fused.

In questions that involve changing both CPI and NumInsts, it is often easiestto work with the product of CPI and NumInsts, which represents the totalnumber of clock cycles needed to execute the program. Additionally, set theproblem up with an imaginary program of 100 instructions on the baselinesystem.

NumMAC2 = 0.5×NumMul1= 0.5×5= 2.5

NumMUL2 = 0.5×NumMul1= 0.5×5= 2.5

NumADD2 = NumAdd1−0.5×NumMul1= 15−0.5×5= 12.5

Find the total number of clock cycles for each option.

Cycles1 = NumMUL1×CPIMUL +NumADD1×CPIADD +NumOth1×CPIOth= (5×1.2)+(15×0.8)+(80×1.0)= 98

Cycles2 = (NumMAC2×CPIMAC)+(NumMUL2×CPIMUL)+(NumADD2×CPIADD)+(NumOth2×CPIOth)

= (2.5×1.2)+(2.5×1.2)+(12.5×0.8)+(80×1.0)= 96

Cycles3 = (NumMAC3×CPIMAC)+(NumMUL3×CPIMUL)

+(NumADD3×CPIADD)+(NumOth3×CPIOth)= (2.5× (1.5×1.2))+(2.5×1.2)+(12.5×0.8)+(80×1.0)= 97.5

3.4.4 Effect of Time to Market on Relative Performance 237

Calculate performance for each option using the formula:

Performance =1

Cycles×ClockPeriod

Performance1 = 1/(98×1)= 1/98

Performance2 = 1/(96×1.2)= 1/115

Performance3 = 1/(97.5×1)= 1/97.5

The third option is the fastest.

3.4.4 Effect of Time to Market on Relative Performance

Assume that performance of the average product in your market segment doubles every 18 months.

You are considering an optimization that will improve the performance of your product by 7%.

Question: If you add the optimization, how much can you allow your schedule to slipbefore the delay hurts your relative performance compared to not doing theoptimization and launching the product according to your current schedule?

Answer:

P(t) = performance at time t= P0×2t/18

From problem statement:P(t) = 1.07×P0

Equate two equations for P(t), then solve for t.1.07×P0 = P0×2t/18

2t/18 = 1.07t/18 = log21.07

t = 18× (log21.07)

Use: logbx =logxlogb

= 18×

(log1.07

log2

)

= 1.76months


3.4.5 Summary of Equations

Time to perform a task:

Time =NumInsts×CPI

ClockSpeed

Average time to do one of k different tasks:

TAvg =k

∑i=1

(%i)(Ti)

Performance:

Performance =WorkTime

Speedup:

Speedup =TSlowTFast

TFast is n% faster thanTSlow:

n% faster =TSlow−TFast

TFast

Performance at timet if performance increases by factor ofk everyn units of time:

Perf (t) = Perf (0)×kt/n

3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS 239

3.5 Performance Analysis and Dataflow Diagrams

3.5.1 Dataflow Diagrams, CPI, and Clock Speed

One of the challenges in designing a circuit is to choose the clock speed. Increasing the clockspeed of a circuit might not improve its performance. In thissection we will work through severalexample dataflow diagrams to pick a clock speed for the circuit and schedule operations into clockcycles.

When partitioning dataflow diagrams into clock cycles, we need to choose a clock period. Choos-ing a clock period affects many aspects of the design, not just the overall performance. Differentdesign goals might put conflicting pressure on the clock period: some goals will tend toward shortclock periods and some goals will tend toward long clock periods. For performance, not only isclock period a poor indicator of the relative performance oftwo different systems, even for thesame system decreasing the clock period might not increase the performance.

Goal Action AffectMinimize area decrease clock pe-

riodfewer operations per clock cycle, sofewer datapath components and moreopportunities to reuse hardware

Increase scheduling flexibil-ity

increase clock pe-riod

more flexibility in grouping operationsin clock cycles

Decrease percentage of clockcycle spent in flops (overhead— time in flops is not doinguseful work)

increase clock pe-riod

decreases number of flops that data tra-verses through

Decrease time to execute aninstruction

???? depends on dataflow diagram

Our general plan to find the clock period for maximum performance is:

1. Pick clock period to be delay through slowest component + delay through flop.

2. For each instruction, for each operation, schedule the operation in the earliest clock cyclepossible without violating clock-period timing constraints.

3. Calculate average time to execute an instruction as:

Combine: Time =NumInsts×CPI

ClockSpeed

and: CPIavg =k

∑i=1

%i×CPIi

to derive: Time =

NumInsts×

(k

∑i=1

%i×CPIi

)

ClockSpeed


4. If the maximum latency through dataflow diagram is greaterthan 1, then increase clockperiod by minimum amount needed to decrease latency by one clock period and return toStep 2.

5. If the maximum latency through dataflow diagram is 1, then clock period for highest perfor-mance is clock period resulting in fastestTime.

6. If possible, adjust the schedule of operations to reduce the maximum number of occurrencesof a component per instruction per clock cycle without increasing latency for any instruction.

3.5.2 Examples of Dataflow Diagrams for Two Instructions

Circuit supports two instructions, A and B (e.g. multiply and divide). At any point in time, thecircuit is doing either A or B — it does not need to support doing A and B simultaneously.

The diagrams below show the flow for each instruction and the delay through the components(f,g,h,i) that the instructions use.

The delay through a register is 5ns.

Each operation (A and B) occurs 50% of the time.

Our goal is to find a clock period and dataflow diagram for the circuit that will give us the highestoverall performance.

Instruction A

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

Instruction B

i (40ns)

g (50 ns)

3.5.2 Examples of Dataflow Diagrams for Two Instructions 241

3.5.2.1 Scheduling of Operations for Different Clock Periods

55ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

55ns

55ns

55ns

55ns

Instr A Instr B

75ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)75ns

75ns

75ns

Instr A Instr B

85ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)85ns

85ns

Instr A Instr B

95ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

95ns

95ns

Instr A Instr B

155ns Clock Period

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)155ns

Instr A Instr B

3.5.2.2 Performance Computation for Different Clock Periods

Question: Which clock speed will result in the highest overall performance?


Answer:Clock Period CPIA CPIB Tavg

55ns 4 2 55× (0.5×4+0.5×2) = 16575ns 3 2 75× (0.5×3+0.5×2) = 187.585ns 2 2 85× (0.5×2+0.5×2) = 17095ns 2 1 95× (0.5×2+0.5×1) = 143←−

155ns 1 1 155× (0.5×1+0.5×1) = 155

3.5.2.3 Example: Two Instructions Taking Similar Time

Question: For the flow below, which clock speed will result in the highest overallperformance?

A B30ns 40ns50ns 50ns20ns 40ns50ns

Answer:

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

55ns

55ns

55ns

55ns

i (40ns)

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

75ns

75ns

75nsi (40ns)

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)

85ns

85ns

85ns i (40ns)

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)95ns

95nsi (40ns)

3.5.2 Examples of Dataflow Diagrams for Two Instructions 243

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)105ns

105nsi (40ns)

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)135ns

135ns

i (40ns)

Should skip 105 ns, because it has same latency as 95 ns.

f (30ns)

g (50 ns)

h (20 ns)

g (50 ns)

i (40ns)

g (50 ns)155ns

i (40ns)

Clock Period CPIA CPIB Tavg55ns 4 3 19375ns 3 3 22585ns 2 3 21395ns 2 2 190105ns 2 2 NO GAIN135ns 2 1 203155ns 1 1 155

A clock period of 155 ns results in the highest performance.

For a clock period of 105 ns, we did not calculate the performance, becausewe could see that it would be worse than the performance with a clock periodof 95 ns. The dataflow diagram with a 105 ns clock period has the samelatency as the diagram with a clock period of 95 ns. If the data flow diagramwith the longer clock period has the same latency as the diagram with theshorter clock period, then the diagram with the longer clock period will havelower performance.

3.5.2.4 Example: Same Total Time, Different Order for A

Question: For the flow below, which clock speed will result in the highest overallperformance?


A B30ns 40ns20ns 50ns50ns 40ns50ns

Answer:

Clock Period CPIA CPIB Tavg55ns 3 3 165ns95ns 3 2 238ns

105ns 2 2 210ns135ns 2 1 203ns155ns 1 1 155ns

A clock period of 155 ns results in lowest averageexecution time, and hence the highestperformance.

This is the same answer as the previous problem,but the total times for higher clock frequenciesdiffer significantly between the two problems.

3.5.3 Example: From Algorithm to Optimized Dataflow

This question involves doing some of the design work for a circuit that implements InstP and InstQusing the components described below.

Instruction Algorithm Frequence of OccurrenceInstP a×b× ((a×b)+(b×d)+e) 75%InstQ (i + j +k+ l)×m 25%

Component Delays2-input Mult 40ns2-input Add 25nsRegister 5ns

NOTES• There is a resource limitation of a maximum of 3 input ports. (There are no other resource

limitations.)

• You must put registers on your inputs, you do not need to register your outputs.

• The environment will directly connect your outputs (its inputs) to registers.

• Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once — if you need to use a valuein multiple clock cycles, you must store it in a register.

Question: What clock period will result in the best overall performance?

Answer:

3.5.3 Example: From Algorithm to Optimized Dataflow 245

Algorithm Answers (InstP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .a b e

*

+

d

*

+

*b*da*b

(a*b) + (b*d)

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*

InstP data-dep graph

a b e

+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*

InstP: common subexpr elim

b

e

+

d

*

+

*b*d

a*b (a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

a

(b*d) + e

*

InstP: alternative data dependency graph.Both options have critical path of 2mults+2adds.First option allows three operations to be donewith just three inputs (a,b,d). Second optionrequires all four inputs to do three operations.

a b

e+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*

InstP: clock=45ns, lat=4, T=200


a b

e

+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*

InstP: clock=55ns, lat=3, T=165ns

a b

e+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*


a b e

+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*

InstP: illegal: 4 inputs

b

e

+

d

*

+

*b*d

a*b (a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

70ns

a

(b*d) + e

*

InstP: dataflow diagram with alternativedata-dep graph.Adds a third clock cycle without any gainin clock speed. From diagram, it’s clear thatit’s better to put a*b in first clock cycle and e insecond, because a*b can be done in parallelwith b*d.

Fastest option for InstP is 70ns clock, which gives a total execution time of140 ns.


Algorithm Answers (InstQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

i j l m

+ +

k

*

+

InstQ: data-dep graph with max parallelism

i j

l m

+

+

k

*

+

InstQ: alternative data-dep graph:able to do two operations with three inputs,while first data-dep graph required four inputsto do two operations. We are limited to threeinputs, so choose this data-dep graph fordataflow diagrams.

i j

l m

+

+

k

*

+

InstQ: clock=50ns, lat=4, T=200ns.

i j

l m

+

+

k

*

+



i j

l m

+

+

k

*

+


i j

l m

+

+

k

*

+

InstQ: irrelevant: lat did not decrease

i j

l m

+

+

k

*

+

InstQ: clock=120ns, lat=1, T=120ns

i j

l m

+

+

k

*

+70ns

InstQ

Fastest option for InstQ is 70ns clock, which gives a total execution time of140 ns.

Both InstP and InstQ need a 70ns clock period to maximize theirperformance. So, use a 70ns clock, which gives a latency of 2 clock cycles forboth instructions.

Fastest execution time 140nsClock period 70ns


Question: Find a minimal set of resources that will achieve the performance youcalculated.

Answer:

Final dataflow graphs for InstP and InstQ

a b

e+

d

*

+

*b*da*b

(a*b) + (b*d) + e

(a*b)*((a*b) + (b*d) + e)

*


i j

l m

+

+

k

*

+70ns

InstQ

Need do only one of InstP and InstQ at any time, so simply take max of eachresource.

InstP InstQ SystemInputs 3 3 3Outputs 1 1 1Registers 3 3 3Adders 2 2 2Multipliers 2 1 2


Question: Design the datapath and state machine for your design

Answer:

a b

e

+

d

*

+

**

i j

l m

+

+

k

*

+

InstQ: clock=70ns, lat=2, T=140ns.InstP: clock=70ns, lat=2, T=140ns.

r1 r2 r3

m1 m2

r1 r2r3

a2

a1

r1 r2 r3

a2

r1 r2 r3

a1 a1

m2 m2

S0

S1

S0

S0

S1

S0

i1 i2 i3 i1 i2 i3

o1 o1

i2 i2 i3

Control Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .r1 r2 r3 m1 m2 a1 a2

ce mux ce mux ce mux src1 src2 src1 src2 src1 src2 src1 src2InstP S0 1 i1 1 i2 1 i3 r1 r2 r3 a1 – – m1 m2InstP S1 1 a2 1 i2 1 m1 – – r2 r3 r1 r2 – –InstQ S0 1 i1 1 i2 1 i3 – – a1 r3 r1 r2 a1 r3InstQ S1 1 a2 1 i2 1 i3 – – – – r1 r2 – –

Optimize Control Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .r1 r2 r3 m1 m2 a1 a2

mux mux mux src1 src2 src1 src2 src1 src2 src1 src2InstP S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 m1 m2InstP S1 a2 i2 m1 r1 r2 r2 r3 r1 r2 m1 m2

InstQ S0 i1 i2 i3 r1 r2 a1 r3 r1 r2 a1 r3InstQ S1 a2 i2 i3 r1 r2 r2 r3 r1 r2 a1 r3


Write VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Use the optimized control table as basis for VHDL code.


if state=S0 thenr1 <= i1

elser1 <= a2

end if;end if;

end process;process (clk) begin

if rising_edge(clk) thenr2 <= i2


if rising_edge(clk) thenif inst=instP and state=S0 then

r3 <= m1else

r1 <= i3end if;

end if;end process;m1 <= r1 * r2;m2_src1 <= r2 when state=S0

else a1;m2 <= m2_src1 * r3;a1 <= r1 + r2;a2 <= a2_src1 + a2_src2;process (inst, m1, m2, a1, r3) begin

if inst=instP thena2_src1 <= m1;a2_src2 <= m2;

elsea2_src1 <= a1;a2_src2 <= r3;

end if;end process;


3.6 General Optimizations

3.6.1 Strength Reduction

Strength reduction replaces one operation with another that is simpler.

3.6.1.1 Arithmetic Strength Reduction

Multiply by a constant power of two wired shift logical leftMultiply by a power of two shift logical leftDivide by a constant power of two wired shift logical rightDivide by a power of two shift logical rightMultiply by 3 wired shift and addition

3.6.1.2 Boolean Strength Reduction

Boolean tests that can be implemented as wires• is odd, iseven : least significant bit

• is neg, ispos : most significant bit

• NOTE: use isodd(a) rather than a(0)

By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire.

For example if your state uses a one-hot encoding, then the comparisonstate = S3 reducesto state(3) = ’1’ . You might expect a reasonable logic-synthesis tool to do this reductionautomatically, but most tools do not do this reduction.

When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vectorcomparisons. By carefully choosing our state assignments,when we use a full binary encoding for8 states, the comparison:

(state = S0 or state = S3 or state = S4) = ’1’

can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is truefor four states, then we can find an encoding that looks at just1 bit.

3.6.2 Replication and Sharing 253

3.6.2 Replication and Sharing

3.6.2.1 Mux-Pushing

Pushing multiplexors into the fanin of a signal can reduce area.

Beforez <= a + b when (w = ’1’)

else a + c;

Aftertmp <= b when (w = ’1’)

else c;z <= a + tmp;

The first circuit will have two adders, while the second will have one adder. Some synthesis toolswill perform this optimization automatically, particularly if all of the signals are combinational.

3.6.2.2 Common Subexpression Elimination

Introduce new signals to capture subexpressions that occurmultiple places in the code.

Beforey <= a + b + c when (w = ’1’)

else d;z <= a + c + d when (w = ’1’)

else e;

Aftertmp <= a + c;y <= b + tmp when (w = ’1’)

else d;z <= d + tmp when (w = ’1’)

else e;

Note: Clocked subexpressions Care must be taken when doing commonsubexpression elimination in a clocked process. Putting the “temporary” sig-nal in the clocked process will add a clock cycle to the latency of the com-putation, because the tmp signal will be flip-flop. The tmp signal must becombinational to preserve the behaviour of the circuit.

3.6.2.3 Computation Replication

• To improve performance

– If same result is needed at two very distant locations and wire delays are significant, it mightimprove performance (increase clock speed) to replicate the hardware

• To reduce area

– If same result is needed at two different times that are widely separated, it might be cheaper toreuse the hardware component to repeat the computation thanto store the result in a register

Note: Muxes are not free Each time a component is reused, multiplexorsare added to inputs and/or outputs. Too much sharing of a component can costmore area in additional multiplexors than would be spent in replicating thecomponent


3.6.3 Arithmetic

VHDL is left-associative. The expressiona + b + c + d is interpreted as(((a + b) +c) + d) . You can use parentheses to suggest parallelism.

Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of aresult, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smallerand faster design than computing all 16 bits of the result andtrimming the result to 12 bits.

3.7 Retiming

state

a

b

c

sel

x y z

critical path

state S0 S1 S2 S3 S0 S1 S2 S3a b c

sel x y z

αβγ1α

α+γα+γ

process beginwait until rising_edge(clk);if state = S1 then

z <= a + c;else

z <= b + c;end if;

end process;

Retimed Circuit and Waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .

3.7. RETIMING 255

state

a

b

c

sel

x y z

state S0 S1 S2 S3 S0 S1 S2 S3a b c

sel x y z

αβγ

process (state) beginif state = S1 then

sel = ’1’else

sel = ’1’end if;


wait until rising_edge(clk);if sel = ’1’ then

... -- code for zend if;

end process;

process beginwait until rising_edge(clk);if state = then

sel = ’1’else

sel = ’1’end if;


wait until rising_edge(clk);if sel = ’1’ then

... -- code for zend if;

end process;


3.8 Performance Analysis and Optimization Problems

P3.1 Farmer

A farmer is trying to decide which of his two trucks to use to transport his apples from his orchardto the market.

Facts:

capacity oftruck

speed whenloaded with

apples

speed whenunloaded (noapples)

big truck 12 tonnes 15kph 38kphsmall truck 6 tonnes 30kph 70kph

distance to market 120 kmamount of apples 85 tonnes

NOTES:

1. All of the loads of apples must be carried using the same truck

2. Elapsed time is counted from beginning to deliver first load to returning to the orchard afterthe last load

3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc.

4. For each trip, a truck travels either its fully loaded or empty speed.

Question: Which truck will take the least amount of time and what percentage fasterwill the truck be?

Question: In planning ahead for next year, is there anything the farmercould do todecrease his delivery time with little or no additional expense? If so, what is it, if not,explain.

P3.2 Network and Router 257

P3.2 Network and Router

In this question there is a network that runs a protocol called BigLan. You are designing a routercalled the DataChopper that routes packets over the networkrunning BigLan (i.e. they’re BigLanpackets).

The BigLan network protocol runs at a data rate of 160 Mbps (Megabits per second). Each BigLanpacket contains 100 Bytes of routing information and 1000 Bytes of data.

You are working on the DataChopper router, which has the following performance numbers:

75MHz clock speed4 cycles for a byte of either data or header500 number of additional clock cycles to process the routinginformation

for a packet

P3.2.1 Maximum Throughput

Which has a higher maximum throughput (as measured indata bits per second — that is only thepayload bits count as useful work), the network or your router, and how much faster is it?

P3.2.2 Packet Size and Performance

Explain the effect of an increase in packet length on the performance of the DataChopper (asmeasured in the maximum number of bits per second that it can process) assuming the headerremains constant at 100 bytes.

P3.3 Performance Short Answer

If performance doubles every two years, by what percentage does performance go up every month?This question is similar to compound growth from your economics class.

P3.4 Microprocessors

TheYmemicroprocessor is very small and inexpensive. One performance sacrifice the designershave made is to not include a multiply instruction. Multiplies must be written in software usingloops of shifts and adds.

TheYmecurrently ships at a clock frequency of 200MHz and has an average CPI of 4.

A competitor sells theY!v1 microprocessor, which supports exactly the same instructions as theYme. TheY!v1 runs at 150MHz, and the average program is 10% faster on theYmethan it is ontheY!v1 .


P3.4.1 Average CPI

Question: What is the average CPI for theY!v1 ? If you don’t have enoughinformation to answer this question, explain what additional information you needand how you would use it?

A new version of theY! , the Y!u2 has just been announced. TheY!u2 includes a multiplyinstruction and runs at 180MHz. TheY!u2 publicity brochures claim that using their multiplyinstruction, rather than shift/add loops, can eliminate 10% of the instructions in the average pro-gram. The brochures also claim that the average performanceof Y!u2 is 30% better than that oftheY!v1 .

P3.4.2 Why not you too?

Question: Assuming the advertising claims are true, what is the average CPI for theY!u2 ? If you don’t have enough information to answer this question, explain whatadditional information you need and how you would use it?

P3.4.3 Analysis

Question: Which of the following do you think is most likely and why.

1. theY!u2 is basically the same as theY! v1 except for the multiply

2. theY!u2 designers made performance sacrifices in their design in order to include a multiplyinstruction

3. theY!u2 designers performed other significant optimizations in addition to creating a mul-tiply instruction


Draw an optimized dataflow diagram that improves the performance and produces the same outputvalues. Or, if the performance cannot be improved, describethe limiting factor on the performance.

NOTES:


• you maynot increase the resource usage (input ports, registers, output ports, f components, gcomponents)


P3.6 Performance Optimization with Memory Arrays 259

f

f

a b c

d

g

f

g

e

Before Optimization

f

f

a b

c

d

g

f

g

e

After Optimization

P3.6 Performance Optimization with Memory Arrays

This question deals with the implementation and optimization for the algorithm and library ofcircuit components shown below.

Algorithmq = M[b];if (a > b) then

M[a] = b;p = (M[b-1]) * b) + M[b];

elseM[a] = b;p = M[b+1] * a;

end;


NOTES:1. 25% of the time,a > b

2. The inputs of the algorithm area andb.

3. The outputs of the algorithm arep andq.


5. You may choose to read your input data values at any time andproduce your outputs at anytime. For your inputs, you may read each value only once (i.e.the environment will not sendmultiple copies of the same value).



7. Mis an internal memory array, which must be implemented as dual-ported memory with oneread/write port and one write port.

8. Assume all memory address and other arithmetic calculations are within the range of repre-sentable numbers (i.e. no overflows occur).


10. Your dataflow diagram must include circuitry for computing a > b and using the result tochoose the value forp

Draw a dataflow diagram for each operation that is optimized for the fastest overall execution time.

NOTE: You may sacrifice area efficiency to achieve high performance, but marks will be deductedfor extra hardware that does not contribute to performance.

P3.7 Multiply Instruction

You are part of the design team for a microprocessor implemented on an FPGA. You currently im-plement your multiply instruction completely on the FPGA. You are considering using a special-ized multiply chip to do the multiplication. Your task is to evaluate the performance and optimalitytradeoffs between keeping the multiply circuitry on the FPGA or using the external multiplier chip.

If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will notchange the CPI of any other instruction. Using the multiplier chips will also force the FPGA to runat a slower clock speed.

FPGA option FPGA + MULT option

FPGA FPGA

MULT

average CPI 5 ???% of instrs that are multiplies 10% 10%CPI of multiply 20 6Clock speed 200 MHz 160 MHz

P3.7.1 Highest Performance

Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), andwhat percentage faster is the higher-performance option?

P3.7 Multiply Instruction 261

P3.7.2 Performance Metrics

Explain whether MIPs is a good choice for the performance metric when making this decision.

Chapter 4

Functional Verification

4.1 Introduction

4.1.1 Purpose

The purpose of this chapter is to illustrate techniques to quickly and reliably detect bugs in datapathand control circuits.

Section 4.5 discusses verification of datapath circuits andintroduces the notions of testbench, spec-ification, and implementation. In section 4.6 we discuss techniques that are useful for debuggingcontrol circuits.

The verification guild website:

http://www.janick.bergeron.com/guild/default.htm

is a good source of information on functional verification.

4.2 Overview

The purpose of functional verification is to detect and correct errors that cause a system to produceerroneous results. The terminology for validation, verification, and testing differs somewhat fromdiscipline to discipline. In this section we outline some ofthe terminology differences and describethe terminology used in E&CE 327. We then describe some of thereasons that chips tend to workincorrectly.

263

264 CHAPTER 4. FUNCTIONAL VERIFICATION

4.2.1 Terminology: Validation / Verification / Testing

functional validationComparing the behaviour of a design against the customer’s expectations. In validation, the“specification” is the customer. There is no specification that can be used to evaluate thecorrectness of the design (implementation).

functional verificationComparing the behaviour of a design (e.g. RTL code) against aspecification (e.g. high-levelmodel) or collection of properties

• usually treats combinational circuitry as having zero-delay

• usually done by simulating circuit withtest vectors

• big challenges are simulation speed and test generation

formal verificationchecking that a design has the correct behaviour for every possible input and internal state

• uses mathematics to reason about circuit, rather than checking individual vectors of 1s and0s

• capacity problems: only usable on detailed models of small circuits or abstract models oflarge circuits

• mostly a research topic, but some practical applications have been demonstrated

• tools include model checking and theorem proving

• formal verification isnot a guarantee that the circuit will work correctly

performance validationchecking that implementation has (at least) desired performance

power validationchecking that implementation has (at most) desired power

equivalence verification (checking)checking that the design generated by a synthesis tool has same behaviour as RTL code.

timing verificationchecking that all of the paths in a circuit fit meet the timing constraints

Hardware vs Software Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

Note: in software “testing” refers to running programs withspecific inputs and checking if theprogram does the right thing. In hardware, “testing” usually means “manufacturing testing”, whichis checking the circuits that come off of the manufacturing line.

4.2.2 The Difficulty of Designing Correct Chips 265

4.2.2 The Difficulty of Designing Correct Chips

4.2.2.1 Notes from Kenn Heinrich (UW E&CE grad)

“Everyone should get a lecture on why their first industrial design won’t work in the field.”

Here are few reasons getting a single system to work correctly for a few minutes in a university labis much easier than getting thousands of systems to work correctly for months at a time in dozensof countries around the world.

1. You forgot to make your “unreachable” states transition to the initial (reset) state.Clockglitches, power surges, etc will occasionally cause your system to jump to a state that isn’tdefined or produce an illegal data value. When this happens, your design should reset itself,rather than crash or generatel illegal outputs.

2. You have internal registers that you can’t access or test.If you can set a register you musthave some way of reading the register from outside the chip.

3. Another chip controls your chip, and the other chip is buggy.All of your external controllines should be able to be disabled, so that you can isolate the source of problems.

4. Not enough decoupling capacitors on your board.The analog world is cruel and and un-usual. Voltage spikes, current surges, crosstalk, etc can all corrupt the integrity of digitalsignals. Trying to save a few cents on decoupling capacitorscan cause headaches and sig-nificant financial costs in the future.

5. You only tested your system in the lab, not in the real world.As a product, systems willneed to run for months in the field, simulation and simple lab testing won’t catch all of theweirdness of the real world.

6. You didn’t adequately test the corner cases and boundary conditions.Every corner case is asimportant as the main case. Even if some weird event happens only once every six months,if you do not handle it correctly, the bug can still make your system unusable and unsellable.

4.2.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)

More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problemthat whose severity forced the design to be reworked.

Even experienced designers have difficulty building chips that function correctly on the first pass(figure4.1).


61% of new chip designs require at least one re-spin

Functional logic errorAnalog tuning issue

Signal integrity issueClock scheme error

Reliability issueMixed-signal problem

Timing issue (slow paths)Timing issue (fast paths)

IR drop issuesFirmware errorOther problem

(43%)(20%)(17%)(14%)(12%)(11%)(11%)(10%)(10%)(7%)(4%)(3%)

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

At least one error/issue/problem(61%)

Uses too much power

Source: Aart de Geus, Chairman and CEO of Synopsys. Keynote address. Synopsys Users’Group Meeting, Sep 9 2003, Boston USA.

Figure 4.1: Problems found on first-spins of new chip designs

4.3 Test Cases and Coverage

4.3.1 Test Terminology

Test case / test vector :A combination of inputs and internal state values. Represents one possible test of the system.

Boundary conditions / corner cases :A test case that represents an unusual situation on input and/or internal state signals. Cornercases are likely to contain bugs.

Test scenario :A sequence of test vectors that, together, exercise a particular situation (scenario) on a circuit.For example, a scenario for an elevator controller might include a sequence of button pushesand movements between floors.

Test suite :A collection of test vectors that are run on a circuit.

4.3.2 Coverage 267

4.3.2 Coverage

To be absolutely certain that an implementation is correct,we must check every combination ofvalues. This includes both input values and internal state (flip flops).

If we haveni bits of inputs andns bits in flip-flops, we have to test 2ni+ns different cases whendoing functional verification.

Question: If we havenc combinational signals, why don’t we have to test

2ni+ns+nc different cases?

Answer:The value of each combinational signal is determined by the flip flops and

inputs in its fanin. Once the values of the inputs and flip flops are known, thevalue of each combinational signal can be calculated. Thus, thecombinational signals do not add additional cases that we need to consider.

Definition Coverage: The coverage that a suite of tests achieves on a circuit is thepercentage of cases that are simulated by the tests. 100% coverage means that thecircuit has been simulated for all combinations of values for input signals and internalsignals.

Note: Coverage Terminology There are many different types of coverage,which measure everything from percentage of cases that are exercised to num-ber of output values that are exercised.

There are many different commercial software programs thatmeasure code and other types ofcoverage.

Company Tool CoverageCadence Affirma Coverage AnalyzerCadence DAI Coverscan code, expressions, fsmCadence Codecover code, expressions, fsmFintronic FinCov codeSummit Design HDLScore code, events, variablesSynopsys CoverMeter code coverage (dead?)TransEDA Verification Navigator code and fsmVerisity SureCov code, block, values, fsmVeritools Express VCT, VeriCover code, branchAldec Riviera code, block


4.3.3 Floating Point Divider Example

This example illustrates the difficulty of achieving significant coverage on realistic circuits.

Consider doing the functional simulation for a double precision (64-bit) floating-point divider.

Given InformationData width 64 bitsNumber of gates in circuit 10 000Number of assembly-language instructions to simulate onegate for one test case

100

Number of clock cycles required to execute one assemblylanguage instruction on the computer that is running thesimulation

0.5

Clock speed of computer that is running the simulation 1 Gigahertz

Number of Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

Question: How many cases must be considered?

Answer:

item bits num valuessrc1 64 264 = 1.8E+19src2 64 264 = 1.8E+19

NumTestsTot = NumInputCases×NumStateCases

= (264×264)× (20)

= 3.4E+38cases

4.3.3 Floating Point Divider Example 269

Simulation Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Question: How long will it take to simulate all of the different possible cases using asingle computer?

Answer:

1. Calculate number of seconds to simulate one test case

TestTime1:1= 10000gates×100instrsgate

×0.5cyclesinstr

×1E−9secscycle

= 5E−4secs

2. Number of tests per year

NumTests:1=60

secsmin×60

minshour

×24hoursday

×365.25daysyear

TestTime1:1

≈SpeedOfLight in m/s

TestTime1:1

=3E+8secs5E−4secs

= 6E+12cases/year

3. Number of years to test all cases

TestTimeTot=NumTestsTotNumTests:1

=3.4E+38cases

6E+12cases/year= 5.6E+26years

Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

Question: If you can run simulations non-stop for one year on ten computers, whatcoverage will you achieve?

Answer:


1. Number of tests per year using ten computersNumTests:10= 10×NumTests:1

= 10×6E+12cases

= 6E+13cases

2. Calculate coverage achieved by running tests on ten computers for oneyear

Covg =NumTestsRunNumTestsTot

=NumTests:10NumTestsTot

=6E+133E+38

= 2E−25= 0.000000000000000000000002%

The message is that, even with large amounts of computing resources, it isdifficult to achieve numerically significant coverage for realistic circuits.

An effective functional verification plan requires carefully chosen test cases,so that even the miniscule amount of coverage than is realistically achievablecatches most (all?!?!) of the bugs in the design.

Simulation vs the Real World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .

FromValidating the Intel(R) Pentium(R) Microprocessorby Bob Bentley, Design Automation Con-ference 2001. (Link on E&CE 327 web page.)• Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz.

• By tapeout, over 200 billion simulation cycles had been run on a network of computers.

• All of these simulations represent less than two minutes of running a real processor.

4.4 Testbenches

A test bench (also known as a “test rig”, “test harness”, or “test jig”) is a collection of code usedto simulate a circuit and check if it works correctly.

Testbenches are not synthesized. You do not need to restrictyourself to the synthesizable subset ofVHDL. Use the full power of VHDL to make your testbenches concise and powerful.

4.4.1 Overview of Test Benches 271

4.4.1 Overview of Test Benches

stimulus

implementation

specification

check

testbench

Implementation Circuit that you’re checking for bugsalso known as: “design under test” or “unit under test”

Stimulus Generates test vectors

Specification Describes desired behaviour of implementation

Check Checks whether implementation obeys specification

Notes and observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

• Testbenches usually do not have any inputs or outputs.

– Inputs are generated by stimulus

– Outputs are analyzed by check and relevant information is printed usingreport statements

• Different circuits will use different stimuli, specifications, and checks.

• The roles of the specification and check are somewhat flexible.

– Most circuits will have complex specifications and simple checks.

– However, some circuits will have simple specifications and complex checks.

• If two circuits are supposed to have the same behaviour, thenthey can use the same stimuli,specification, and check.

• If two circuits are supposed to have the same behaviour, thenone can be used as the specificationfor the other.

• Testbenches are restricted to stimulating only primary inputs and observing only primary out-puts. To check the behaviour of internal signals, use assertions.


4.4.2 Reference Model Style Testbench

stimulus

implementation

specification

reference model testbench

• Specification has same inputs and outputs as implementation.

• Specification is a clock-cycle accurate description of desired behaviour of implementation.

• Check is an equality test between outputs of specification and implementation.

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

• Execution modules: output is sum, difference, product, quotient,etc.of inputs

• DSP filters

• Instruction decodersNote: “Functional specification” vs “Reference model” Functional specifi-cation and reference model are often used interchangeably.

4.4.3 Relational Style Testbench

stimulus

implementation

relational testbench

check

• Relational testbenches, or relational specifications are used when wedo not want to specify thespecific output values that the implementation must produce.

• Instead, we want to check that some relationship holds between the output and the input, orthat some relationship holds amongst the output values (independent of the values of the inputsignals.)

• Specification is usually just wires to feed the input signalsto the check.

• Check is the brains and encodes the desired behaviour of the circuit.

4.4.4 Coding Structure of a Testbench 273

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

• Carry-save adders: the two outputs are the sum of the three inputs, but do not specify exactvalues of each individiual output.

• Arbiters: every request is eventually granted, but do not specify in which order requests aregranted.

• One-hot encoding: exactly one bit of vector is a’1’ , but do not specify which bit is a’1’ .

Note: “Relational specification” vs “relational testbench” Relational speci-fication and relational testbench are often used interchangeably.

4.4.4 Coding Structure of a Testbench

architecture main of athabasca_tb iscomponent declaration for implementation;other declarations

beginimplementation instantiation;stimulus process;specification process (or component instantiation);check process;

end main;

4.4.5 Datapath vs Control

Datapath and control circuits tend to use different styles of testbenches.

Datapath circuits tend to be well-suited to reference-model style testbenches:

• Each set of inputs generates one set of outputs

• Each set of outputs is a function of just one set of inputs

Control circuits often pose problems for testbenches,

• Many more internal signals than outputs.

• The behaviour of the outputs provides a view into only a fragment of the current state of thecircuit.

• It may take many clock cycles from when a bug is exercised inside the circuit until it generatesa deviation from the correct behaviour on the outputs.

• When the deviation on the outputs is observed, it is very difficult to pinpoint the precise causeof the deviation (the root cause of the bug).

Assertions can be used to check the behaviour of internal signals. Control circuits tend to useassertions to check correctness and rely on testbenches only to stimulate inputs.


4.4.6 Verification Tips

Suggested order of simulation for functional verification.

1. Write high-level model.

2. Simulate high-level model until have correct functionality and latency.

3. Write synthesizable model.

4. Use zero-delay simulation (uw-sim ) to check behaviour of synthesizable model againsthigh-level model.

5. Optimize the synthesizable model.

6. Use zero-delay simulation (uw-sim ) to check behaviour of optimized model against high-level model.

7. Use timing-simulation (uw-timsim ) to check behaviour of optimized model against high-level model.

section 4.5 describes a series of testbenches that are particularly useful for debugging datapathcircuits in the early phases of the design cycle.

4.5 Functional Verification for Datapath Circuits

In this section we will incrementally develop a testbench for a very simple circuit: anAND gate.

Although the example circuit is trivial in size, the processscales well to very large circuits. Theprocess allows verification to begin as soon a circuit is simulatable, even before a complete speci-fication has been written.

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

entity and2 isport (


);end and2;

architecture main of and2 isbegin

c <= ’1’ when (a = ’1’ AND b = ’1’)else ’0’;

end and2;

4.5.1 A Spec-Less Testbench 275

4.5.1 A Spec-Less Testbench

(NOTE: this code has been reviewed manually but has not been simulated. The concepts areillustrated correctly, but there might be typographical errors in the code.)

First, use waveform viewer to check that implementation generates reasonable outputs for a smallset of inputs.

entity and2_tb isend and2_tb;

architecture main_tb of and2_tb iscomponent and2

port (a, b : in std_logic;c : out std_logic

);end component;

signal ta, tb, tc_impl : std_logic;signal ok : boolean;

begin---------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);---------------------------------------------stimulus : processbegin

ta <= ’0’; tb <= ’0’;wait for 10ns;ta <= ’1’; tb <= ’1’;wait for 10ns;

end process;---------------------------------------------

end main_tb;

Use the spec-less testbench until implementation generates solid Boolean values (NoX or U data)and have checked that a few simple test cases generate correct outputs.


4.5.2 Use an Array for Test Vectors

Writing code to drive inputs and repetitively typingwait for 10 ns; can get tedious, so codeup test vectorsin an array.

(NOTE: this code has not been checked for correctness)

architecture main_tb of and2_tb is...

begin...stimulus : process

type test_datum_ty is recordra, rb : std_logic;

end record;type test_vectors_ty is

array(natural range <>) of test_datum_ty;constant test_vectors : test_vectors_ty :=

-- a b( ( ’0’, ’0’),

( ’1’, ’1’));

beginfor i in test_vectors’low to test_vectors’high loop

ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;


end main_tb;

Use this testbench until checking the correctness of the outputs by hand using waveform viewerbecomes difficult.

4.5.3 Build Spec into Stimulus 277

4.5.3 Build Spec into Stimulus

(NOTE: this code has not been checked for correctness)

After a few test vectors appear to be working correctly (via amanual check of waveforms onsimulation), begin automatically checking that outputs are correct.• Add expected result to stimulus

• Add check process


begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);------------------------------------------stimulus : process

type test_datum_ty is recordra, rb, rc : std_logic;

end record;type test_vectors_ty is array(natural range <>) of test_da tum_ty;constant test_vectors : test_vectors_ty :=

-- a, b: inputs-- c : expected output-- a b c( ( ’0’, ’0’, ’0’),

( ’0’, ’1’, ’0’),( ’1’, ’1’, ’1’)

);begin

for i in test_vectors’low to test_vectors’high loopta <= test_vectors(i).ra;tb <= test_vectors(i).rb;tc_spec <= test_vectors(i).rc;wait for 10 ns;

end loop;end process; ---------------------------------------- --check : process (tc_impl, tc_spec)begin

ok <= (tc_impl = tc_spec);end process;------------------------------------------

end main_tb;

Use this testbench until it becomes tedious to calculate manually the correct result for each testcase.


4.5.4 Have Separate Specification Entity

Rather than write the specification as part of stimulus, create separate specification entity/architecture.The specification component then calculates the expected output values.

(NOTE: if your simulation tool supports configurations, thespec and impl can share the sameentity, we’ll see this in section 4.6)

4.5.4 Have Separate Specification Entity 279

entity and2_spec is...(same as and2 entity)...

end and2_spec;

architecture spec of and2_spec isbegin

c <= a AND b;end spec;

architecture main_tb of and2_tb iscomponent and2 ...;component and2_spec ...;signal ta, tb, tc_impl, tc_spec : std_logic;signal ok : boolean;

begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);spec : and2_spec port map (a => ta, b => tb, c => tc_spec);------------------------------------------

stimulus : process

type test_datum_ty is recordra, rb : std_logic;


-- a b( ( ’0’, ’0’),

( ’1’, ’1’));

beginfor i in test_vectors’low to test_vectors’high loop

ta <= test_vectors(i).ra;tb <= test_vectors(i).rb;wait for 10 ns;

end loop;end process;------------------------------------------check : process (tc_impl, tc_spec)begin

ok <= (tc_impl = tc_spec);end process;------------------------------------------

end main_tb;


4.5.5 Generate Test Vectors Automatically

When it becomes tedious to write out each test vector by hand,we can automaticaly compute them.This example uses a pair of nestedfor loop s to generate all four permutations of input valuesfor two signals.


begin...stimulus : process

subtype std_test_ty of std_logic is (’0’, ’1’);begin

for va in std_test_ty’low to std_test_ty’high loopfor vb in std_test_ty’low to std_test_ty’high loop

ta <= va;tb <= vb;wait for 10 ns;

end loop;end loop;

end process;...

end main_tb;

4.5.6 Relational Specification


begin------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl);------------------------------------------stimulus : process

...end process;------------------------------------------check : process (tc_impl, tc_spec)begin

ok <= NOT (tc_impl = ’1’ AND (ta =’0’ OR tb = ’0’));end process;------------------------------------------

end main_tb;

4.6. FUNCTIONAL VERIFICATION OF CONTROL CIRCUITS 281

4.6 Functional Verification of Control Circuits

Control circuits are often more challenging to verify than datapath circuits.• Control circuits have many internal signals. Testbenches are unable access key information

about the behaviour of a control circuit.

• Many clock cycles can elapse between when a bug causes an internal signal to have an incorrectvalue and when an output signal shows the effect of the bug.

In this section, we will explore the functional verificationof state machines via a First-In First-Outqueue.

The VHDL code for the queue is on the web at:

http://www.ece.uwaterloo.ca/˜ece327/exs/queue

4.6.1 Overview of Queues in Hardware

write read

qu

eu

e

Figure 4.2: Structure of queue

Empty Write 1

A

Write 2

A

Figure 4.3: Write Sequence


Write 1

BA

Write 2

BA

Figure 4.4: A Second Example Write

Read 1

BA

Read 2

BA

Figure 4.5: Example Read Sequence

Write 1

BCDEFGHI

J

Write 2

BCDEFGHIJ

Figure 4.6: Write Illustrating Index Wrap

Write 1

BCDEFGHIJ

K

Write 2

BCDEFGHIJ

K

Figure 4.7: Write Illustrating Full Queue

empty

mem

wr_idx

rd_idx

data_wrdata_rd

do_wr

do_rd

Figure 4.8: Queue Signals

empty

mem

wr_idx

rd_idx

data_wr

data_rd

do_wr

do_rd

WE

A0

DI0

DO0

A1 DO1

Figure 4.9: Incomplete Queue Blocks

Control circuitry not shown.

4.6.2 VHDL Coding 283

4.6.2 VHDL Coding

4.6.2.1 Package

Things to notice in queue package:

1. separation of package and body

package queue_pkg issubtype data is std_logic_vector(3 downto 0);function to_data(i : integer) return data;

end queue_pkg;

package body queue_pkg isfunction to_data(i : integer) return data isbegin

return std_logic_vector(to_unsigned(i, 4));end to_data;

end queue_pkg;

4.6.2.2 Other VHDL Coding

VHDL coding techniques to notice in queue implementation:

1. type declaration for vectors

2. attributes

(a) ’low , ’high , ’length ,

3. functions (reduce overall implementation and maintenance effort)

(a) reduce redundant code

(b) hide implementation details

(c) (just like software engineering....)

4.6.3 Code Structure for Verification

Verification things to notice in queue implementation:

1. instrumentation code

2. coverage monitors

3. assertions


architecture ... is...

begin... normal implementation ...process (clk)begin

if rising_edge(clk) then... instrumentation code ...prev_ signame <= signame;

end if;end process;... assertions ...... coverage monitors ...

end;

4.6.4 Instrumentation Code

• Added to implementation to support verification

• Usually keeps track of previous values of signals

• Doesnot create hardware (Optimized away during synthesis)

• Does not feed any output signals

• Must use synthesizable subset of VHDL


prev_rd_idx <= rd_idx;prev_wr_idx <= wr_idx;prev_do_rd <= do_rd;prev_do_wr <= do_wr;

end if;end process;

Note: Naming convention for instrumentation For assertions, signals arenamedprev signame and signame, rather thannext signame andsigname as is done for state machines. This is because for assertionsweuse theprev signals ashistory signals, to keep track of past events. In con-trast, for state machines, we name the signalsnext, because the state machinecomputes the next values of signals.

4.6.5 Coverage Monitors

The goal of a coverage monitors is to check if a certain event is exercised in a simulation run. If atest suite does not trigger a coverage monitor, then we probably want to add a test vector that willtrigger the monitor.

4.6.5 Coverage Monitors 285

For example, for a circuit used in a microwave oven controller, we might want to make sure thatwe simulate the situation when the door is opened while the power is on.

1. Identify important events, conditions, transitions

2. Write instrumentation code to detect event

3. Usereport to write when event happens

4. When run simulation, report statements will print when coverage condition detected

5. Pipe simulation results to log file

6. Examine log file and coverage monitors to find cases and transitions not tested by existingtest vectors

7. Add test vectors to exercise missing cases

8. Idea: automate detection of missing cases using Perl script to find coverage messages inVHDL code that aren’t in log file

9. Real world: most commercial synthesis tools come with add-on packages that provide dif-ferent types of coverage analysis

10. Research/entrepreneurial idea: based on missing coverage cases, find new test vectors toexercise case

Coverage Events for Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .

Prev Now

wr rd

rdwr

Prev Now

wr

rd

rdwr

Prev Now

wr

rd rdwr


Question: What events should we monitor to estimate the coverage of ourfunctionaltests?

Answer:

• wr idx and rd idx are far apart

• wr idx and rd idx are equal

• wr idx catches rd idx

• rd idx catches wr idx

• rd idx wraps

• wr idx wraps

Coverage Monitor Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .

process ( signals read)begin

if ( condition) thenreport "coverage: message";

elsif ( condition) ) thenreport "coverage: message";

elsereport "error: case fall through on message"severity warning;

end if;end process;

Coverage Monitor Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

Events related tord idx equalswr idx .

4.6.6 Assertions 287

process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx)begin

if (rd_idx = wr_idx) thenif ( prev_rd_idx = prev_wr_idx ) then

report "coverage: read = write both moved";elsif ( rd_idx /= prev_rd_idx ) then

report "coverage: Read caught write";elsif ( wr_idx /= prev_wr_idx ) then

report "coverage: Write caught read";else

report "error: case fall through on rd/wr catching"severity warning;

end if;end if;

end process;

Events related tord idx wrapping.

process (rd_idx)begin

if (rd_idx = low_idx) thenreport "coverage: rd mv to low";

elsif (rd_idx = high_idx) thenreport "coverage: rd mv to high";

elsereport "coverage: rd mv normal";

end if;end process;

4.6.6 Assertions

Assertions for Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

1. If rd idx changes, then it increments or wraps.

2. If rd idx changes, thendo rd was’1’ , or reset is ’1’ .

3. If wr idx changes, then it increments or wraps.

4. If wr idx changes, thendo wr was’1’ , or reset is ’1’ .

5. And many others....

Assertion Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

process ( signals read) beginassert ( required condition)

report "error: message" severity warning;end process;


Assertions: Read Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

process (rd_idx) beginassert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx))

report "error: rd inc" severity warning;assert ((prev_do_rd = ’1’) or (reset = ’1’))

report "error: rd imp do_rd" severity warning;end process;

Assertions: Write Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

process (wr_idx) beginassert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx))

report "error: wr inc" severity warning;assert ((prev_do_wr = ’1’) or (reset = ’1’))

report "error: wr imp do_wr" severity warning;end process;

4.6.7 VHDL Coding Tips

Vector Type Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

type data_array_ty is array(natural range <>) of data;signal data_array : data_array_ty(7 downto 0);

Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

function to_idx(i : natural range data_array’low to data_array’high)return idx_ty

isbegin

return to_unsigned(i, idx_ty’length);end to_idx;

Conversion to IndexWithout Function With Function

rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5);

The function code is verbose, but is very maintainable, because neither the function itself nor usesof the function need to know the width of the index vector.

4.6.8 Queue Specification 289

Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

function inc_idx (idx : idx_ty) return idx_ty isbegin

if idx < data_array’high thenreturn (idx + 1);

elsereturn (to_idx(data_array’low));

end if;end inc_idx;

Feedback Loops, and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .

Coding guideline: use functions. Don’t use procedures.

inc as fun inc as procwr_idx <= inc_idx(wr_idx); inc_idx(wr_idx);

Functions clearly distinguish between reading from a signal and writing to a signal. By examiningthe use of a procedure, you cannot tell which signals are readfrom and which are written to. Youmust examine the declaration or implementation of the procedure to determine modes of signals.

Modifying a signal within a procedure results in a tri-statesignal. This is bad.

File I/O (textio package) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .

TEXTIO definesread , write , readline , writeline functions.

Described in:• http://www.eng.auburn.edu/department/ee/mgc/vhdl.ht ml#textio

These functions can be used to read test vectors from a file andwrite results to a file.

4.6.8 Queue Specification

Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap ofindices.

Specification should be “obviously correct”. Avoid bugs in specification by making specificationqueue larger than the max number of writes that we will do in test suite. Thus, the specificationqueue will never become full or wrap. However, the implementation queue will become full andwrap.


Write Index Update in Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

We increment write-index on every write, we never wrap.


if (reset = ’1’) thenwr_idx <= 0;

elsif (do_wr = ’1’) thenwr_idx <= wr_idx + 1;

end if;end if;

end process;

Things to Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

Things to notice in queue specification:

1. don’t care conditions (’-’ )

2. uninitialized data (hint: what is the value ofrd_data when do more reads than writes?

Don’t Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

rd_data <= data_array(rd_idx) when (do_rd =’1’)else (others => ’-’);

4.6.9 Queue Testbench

Things to notice in queue testbench:

1. running multipe test sequences

2. uninitialized data’U’

3. std_match to compare spec and impl data

0 ∼ 00 ∼ L1 ∼ 11 ∼ H- ∼ everything

everything else 6∼ everything

With equality,’-’ 6= ’1’ , but we want to use’-’ to mean “don’t care” in specification.The solution is to usestd match , rather than= to check implementation signals againstthe specification.

4.7. EXAMPLE: MICROWAVE OVEN 291

Stimulus Process Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .

The stimulus process runs multiple test vectors in a single simulation run.

stimulus : processtype test_datum_ty is

recordr_reset, ... normal fields ...


( -- reset ... other signal ...( ’1’, normal fields), -- test case 1( ’0’, normal fields),

...( ’1’, normal fields), -- test case 2( ’0’, normal fields),

...);

beginfor i in test_vectors’range loop

if (test_vectors(i).r_reset = ’1’) then... reset code ...

end if;reset <= ’0’;... normal sequence ...wait until rising_edge(clk);


After reset is asserted, set signals to’U’ .

4.7 Example: Microwave OvenThis question concerns the VHDL codemicrowave , which controls a simple microwave oven;the propertiesprop1...prop3; and two proposed changes to the VHDL code.

INSTRUCTIONS:

1. Assume that the code as currently written is correct — any change to the code that causes achange to the behaviour of the signalsheat or count is a bug.

2. For each of the two proposed code changes, answer whether the code change will cause abug.

3. If the code change will cause a bug, provide a test case thatwill exercise the bug and identifyall of the given properties (prop1, prop2, andprop3) that will detect the bug with the testcase you provide.


4. If none of the three properties can detect the bug, providea property of your own that willdetect the bug with the testcase you provide.

Question: For each of the three propertiesprop1...prop2, answer whether theproperty is best checked as part of a testbench or assertion.For each property, justifywhy a testbench or an assertion is the best method to validatethat property.

prop1 If start is pushed and the door is closed, then heat remains onfor exactly the time specifiedby the timer when start was pushed, assuming reset remains false and the door remainsclosed.

Answer:Testbench: All relevant signals are primary inputs or outputs, so can

check property without seeing internal signals. Testbenches are only ableto set and observe primary inputs and outputs.

prop2 If the door is open, then heat is off.

Answer:Testbench: same as previous property.

prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre-mented.

Answer:Assertion: To see count, need access to internal signals.

entity microwave isport (

timer -- time input from user: in unsigned(7 downto 0);reset, -- resets microwaveclk, -- clock signal inputis_open, -- detects when door is openstart -- start button input from user: in std_logic;heat : out std_logic -- 1=on, 0=off

);end microwave;

architecture main of microwave issignal count : unsigned(7 downto 0); -- internal time countsignal x_heat : std_logic;

begin


-- heat process ------------------------------process (clk)begin


x_heat <= ’0’;elsif (is_open = ’0’) and (start = ’1’) and -- region of

(time > 0) -- change #1then --

x_heat <= ’1’; --elsif (is_open = ’0’) and (count > 0) then --

x_heat <= x_heat; --else

x_heat <= ’0’;end if;

end if;end process;

-- count process ------------------------------process (clk)begin


count <= to_unsigned(0, 8);elsif (start = ’1’) then -- region of

count <= timer; -- change #2elsif (count > 0) then --

count <= count - 1; --end if;

end if;end process;heat <= x_heat;

end main;

Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

prop1 If start is pushed and the door is closed, then heat remains onfor exactly the time specifiedby the timer when start was pushed, assuming reset remains false and the door remainsclosed.

prop2 If the door is open, then heat is off.

prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decre-mented.

Change #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .


From:

elsif (start = ’1’) thencount <= time;

elsif (count > 0) thencount <= count - 1;

To:

elsif (count > 0) thencount <= count - 1;

elsif (start = ’1’) thencount <= time;

Answer:

The change introduces a bug that is caught by properties 1 and 3.

Test Casestestcase1 Maintain reset=0. Close door, set timer to some value (v1) and

then push start. Leave the door closed. While the microwave is on, settimer to a value (v2) that is greater than v1 and then push start.In old the code, the new value on the timer will be read in. In the newcode, the new value on the timer will be ignored. The reason to make v2

greater than v1 is to prevent counter from being exactly equal to v2 whenstart is pushed a second time. In that case, the bug would not beexercised. Note, the old code violated prop1 .

testcase2 reset = 0, microwave off, door closed, count = 0. Set timer to anon-zero value. Press and hold start for a number of cycles. In theoriginal code, the value of timer would be reloaded into count on eachrising edge of the clock. With the change, the value of count continues todecrement and the timer is not reloaded into count. Note: in this case,only prop1 will detect the bug. Prop3 will not detect the bug because theantecedent, or precondition for the property is false.

Change #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

From:

elsif (is_open = ’0’) and (start = ’1’) and (time > 0)then x_heat <= ’1’;elsif (is_open = ’0’) and (count > 0)then x_heat <= x_heat;

To:

elsif (is_open = ’0’)and ((start = ’1’) or (count > 0))

then x_heat <= ’1’;else x_heat <= ’0’;


Answer:

The change introduces a bug that would be caught by prop1, but not by prop2or prop3.

The following scenario or test case will catch the bug with prop1. Maintainreset=0. Microwave is off, door is closed, timer is set to 0. Push start. Withold code, microwave will remain off. With new code, microwave will turn onand remain on as long as start is pushed.

The change to code exercises another bug that is not caught by prop1. Thisbug demonstrates a weakness in prop1 that should be remedied.

Testcase: reset = 0, microwave off, door closed. Set timer to a non-zerovalue. Press (and release) start. Before timer expires, open door. Close doorbefore count = 0. In the original code, the microwave will remain off, but withthe change, the microwave will start again. Note: the same properties detectthe bug as with the original solution.

The weakness in prop1 is that it assumes that door remains closed. So, anytestcase where the door is opened will pass prop1. In verification, this isknown as the “false implies anything problem”, or a testcase that passes aproperty “vacuously”.

To catch this bug, we must either change prop1 or add another property. Infact, we probably should do both.

First we strengthen prop1 to deal with situations where the door is openedwhile the microwave is on. The property gets a bit complicated: “If start ispushed and the door is closed, then heat remains on until the earlier of eitheropening of the door or the expiration of the time specified by the timer whenstart was pushed, assuming reset remains false.”

Second, we add a property to ensure that the microwave does not turn backon when the door is re-closed with time remaining on the counter: “If themicrowave is off, it remains off until start is pushed.” This fourth property iswritten to be as general as possible. We want to write properties that catch asmany bugs as possible, rather than write properties for specific testcases orbugs.

Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

Question: If msb of src1 is ’1’ and lsb of src2 is ’0’ or sum(3) is ’1’, thenresult iswrong. What is the minimum coverage needed to detect bug? What is the minimimcoverage needed to guarantee that the bug will be detected?


4.8 Functional Verification Problems

P4.1 Carry Save Adder

1. Functionality Briefly describe the functionality of a carry-save adder.

2. TestbenchWrite a testbench for a 16-bit combinational carry save adder.

3. Testbench MaintenanceModify your testbench so that it is easy to change the width oftheadder and the latency of the computation.

NOTES:

(a) You do not need to support pipelined adders.

(b) VHDL “generics” might be useful.

P4.2 Traffic Light Controller

P4.2.1 Functionality

Briefly describe the functionality of a traffic-light controller that has sensors to detect the presenceof cars.

P4.2.2 Boundary Conditions

Make a list of boundary conditions to check for your traffic light controller.

P4.2.3 Assertions

Make a list of assertions to check for your traffic light controller.

P4.3 State Machines and Verification 297

P4.3 State Machines and Verification

P4.3.1 Three Different State Machines

s0 s1

s2s3

1/0

0/0 */0

*/0

*/1

Figure 4.10: A very simple machine

s0 s1

s3

s4

*/0s2

s8

s7

s9

s6

s5

*/0

*/0

*/0

*/0*/0

*/0

*/0

*/0

*/1

Figure 4.11: A very big machine

s0 s1

s2

*/0

*/0

*/0

*/1

q0 q1

q2

q4

*/0 */0

*/0

*/1

q3 */0

Figure 4.12: A concurrent machine

input/output

* = don’t care

Figure 4.13: Legend

Answer each of the following questions for the three state machines in figures4.10–4.12.

Number of Test Scenarios How many “test scenarios” (sequences of test vectors) wouldyouneed to fully validate the behaviour of the state machine?

Length of Test Scenario What is the maximum length (number of test vectors) in a test scenariofor the state machine?


Number of Flip Flops Assuming that neither the inputs nor the outputs are registered, what isthe minimum number of flip-flops needed to implement the statemachine?

P4.3.2 State Machines in General

If a circuit hasi signals of 1-bit each that are inputs,f 1-bit signals that are outputs of flip-flopsandc 1-bit signals that are the outputs of combinational circuitry, what is the maximum number ofstates that the circuit can have?

P4.4 Test Plan Creation

You’re on the functional verification team for a chip that will control a simple portable CD-player. Your task is to create a plan for the functional verification for the signals in the entitycd digital .

You’ve been told that the player behaves “just like all of theother CD players out there”. If yourtest plan requires knowledge about any potential non-standard features or behaviour, you’ll needto document your assumptions.

pwr

track min

prev nextstop play

sec

entity cd_digital isport (

--------------------------------------------------- --- buttonsprev,stop,play,next,pwr : in std_logic;--------------------------------------------------- --- detect if player door is openopen : in std_logic;--------------------------------------------------- --- output display informationtrack : out std_logic_vector(3 downto 0);min : out unsigned(6 downto 0);sec : out unsigned(5 downto 0)

);end cd_digital;

P4.5 Sketches of Problems 299

P4.4.1 Early Tests

Describe five tests that you would run as soon as the VHDL code is simulatable. For each test:describe what your specification, stimulus, and check. Summarize the why your collection of testsshould be the first tests that are run.

P4.4.2 Corner Cases

Describe fivecorner-casesor boundary conditions, and explain the role of corner cases andboundary conditions in functional verification.

NOTES:1. You may reference your answer for problem P4.4.1 in this question.

2. If you do not know what a “corner case” or “boundary condition” is, you may earn partial

credit by: checking this box and explaining five things that you would do in functionalverification.

P4.5 Sketches of Problems

1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieven% coverage.

2. Given a fragment of VHDL code, list things to do to make it more robust — e.g. illegal dataand states go to initial state.

3. Smith Problem 13.29

Chapter 5

Timing Analysis

5.1 Delays and Definitions

In this section we will look at the different timing parameters of circuits. Our focus will be onthose parameters that limit the maximum clock speed at whicha circuit will work correctly.

5.1.1 Background Definitions

Definition fanin: Thefaninof a gate or signalx are all of the gates or signalsy where aninput of x is connected to anoutput of y.

Definition fanout: Thefanoutof a gate or signalx are all of the gates or signalsy whereanoutput of x is connected to aninput of y.

y1

y2

y3

y4

y0

x

Figure 5.1: Immediate Fanin ofx

x y1

y2

y3

y4

y0

Figure 5.2: Immediate Fanout ofx

301

302 CHAPTER 5. TIMING ANALYSIS

Definition immediate fanin/fanout: The phrasesimmediate fanoutandimmediate faninmean that there is a direct connection between the gates.

x

Figure 5.3: Transitive Fanin

x

Figure 5.4: Transitive Fanout

Definition transitive fanin/fanout: The phrasestransitive fanoutandtransitive faninmean that there is either a direct or indirect connection between the gates.

Note: “Immediate” vs “Transitive” fanin and fanout Be careful to dis-tinguish between immediate fan(in/out) and transitive fanin/out. If “fanin”or “fanout” are not qualified with “immediate” or “transitive”, be sure tomake sure whether “immediate” or “transitive” is meant. In E&CE 327,“fan(in/out)” will mean “immediate fan(in/out)”.

5.1.2 Clock-Related Timing Definitions

5.1.2.1 Clock Skew

skew

clk1

clk2

clk3

clk4

clk1

clk2

clk3

clk4

Definition Clock Skew: The difference in arrival times for the same clock edge atdifferent flip-flops.

5.1.2 Clock-Related Timing Definitions 303

Clock skew is caused by the difference in interconnect delays to different points on the chip.

Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticatedsynthesis tools put lots of effort into clock tree design, and the techniques for clock tree design stillgenerate PhD theses.

5.1.2.2 Clock Latency

latency

master clock

intermediate clock

final clock

master clock

inte

rmed

iate

clo

ck

final clock

Definition Clock Latency: The difference in arrival times for the same clock edge atdifferent levels of interconnect along the clock tree. (Intuitively “different points inthe clock generation circuitry.”)

Note: Clock latency Clock latency does not affect the limit on the minimimclock period.

5.1.2.3 Clock Jitter

jitter

ideal clock

clock with jitter

Definition Clock Jitter: Difference between actual clock period and ideal clock period.

Clock jitter is caused by:


• temperature and voltage variations over time

• temperature and voltage variations across different locations on a chip

• manufacturing variations between different parts

5.1.3 Storage-Related Timing Definitions

Storage devices (latches, flip-flops, memory arrays, etc) define setup, hold and clock-to-Q times.

5.1.3.1 Flops and Latches

d

clk

q

Flop Behaviour

d

clk

q

Latch Behaviour

Storage devices have two modes: load mode and store mode.

Flops are edge sensitive: either rising edge or falling edge. An ideal flop is in load mode only forthe instant just before the clock edge. In reality, flops are in load mode for a small window oneither side of the edge.

Latches are level sensitive: either active high or active low. A latch is in load mode when its enablesignal is at the active level.

Timing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

β

d

clk

q

Clock-to-Q

HoldSetup

α β

Flip-flop

d

clk

q

Clock-to-Q

HoldSetup

α β

α β

Active-high latch

d

clk

q

Clock-to-Q

HoldSetup

α β

α β

Active-low latch

Setup and hold define the window in which input data are required to be constant in order toguarantee that storage device will store data correctly. Setup defines the beginning of the window.Hold defines the end of the window. Setup and hold timing constraints ensure that, when thestorage device transitions from load mode to store mode, theinput data is stored correctly in thestorage device. Thus, the setup and hold timing constraintscome into play when the storage devicetransitions from load mode to store mode. Setup is assumed tohappen before the clock edge and

5.1.3 Storage-Related Timing Definitions 305

hold is assumed to happen after the edge. If the end of the timewindow constraint occurs beforethe clock edge, then the hold constraint is negative.

Clock-to-Q defines the delay from the clock edge to when the output is guaranteed to be stable.

Note: Require / Guarantee Setup and hold times arerequirements that thestorage device imposes upon its environment. Clock-to-Q isa guarantee thatthe storage device provides its environment. If the environment satisfies thesetup and hold times, then the storage device guarantees that it will satisfy theclock-to-Q time.

In this section, we will use the definitions of setup, hold andclock-to-Q. Section 5.2 will show howto calculate setup, hold, and clock-to-Q times for flip flops,latches, and other storage devices.

5.1.3.2 Timing Parameters for a Flop

Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .

Definition Setup Time (TSUD) : Latest timebefore arrival of clock edge (flip flop), or

deasserting of enable line (latch), that input data isrequired to be stable in order forstorage device to work correctly.

If setup time is violated, current input data will not be stored; input data frompreviousclock cyclemight remain stored.

5.1.3.3 Hold Time

Definition Hold Time (THO): Latest timeafter arrival of clock edge (flip flop), or

deasserting of enable line (latch), that input data isrequired to remain stable in orderfor storage device to work correctly.

If hold time is violated, current input data will not be stored; input data fromnext clock cyclemight slip through and be stored.

5.1.3.4 Clock-to-Q Time

Definition Clock-to-Q Time (TCO): Earliest timeafter arrival of clock edge (flip flop),

or asserting of enable line (latch) when output data isguaranteedto be stable.


Review: Timing Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .

Setup : Time before arrival of clock edge (flip flop), or deasserting of enable line (latch), thatinput data isrequired to start being stable

Hold : timeafter arrival of clock edge (flip flop), or deasserting of enable line (latch), that inputdata isrequired to remain stable

Clock-to-Q : Timeafter arrival of clock edge (flip flop), or asserting of enable line (latch) whenoutput data isguaranteedto start being stable

5.1.4 Propagation Delays

Propagation delay is the time it takes a signal to travel fromthe source (driving) flop to the desti-nation flop. The two factors that contribute to propagation delay are the load of the combinationalgates between the flops and the delay along the interconnect (wires) between the gates.

5.1.4.1 Load Delays

Load delay is proportional to load capacitance.

Timing of a simple inverter with a load.

Vi Vo

Schematic

1->00->1

Input 1→ 0:Charge output cap

0->11->0

Input 0→ 1:Discharge output

cap

Load capacitance is a dependent on the fanout (how many othergates a gate drives) and how bigthe other gates are.

Section 5.4.2 goes into more detail on timing models and equations for load delay.

5.1.4.2 Interconnect Delays

Wires, also known as interconnect, have resistance, and there is a capacitance between a wire andboth the substrate and parallel wires. Both the resistance and capacitance of wires increase delay.• Wire resistance is dependent upon the material and geometryof the wire.

5.1.5 Summary of Delay Factors 307

• Wire capacitance is dependent on wire geometry, geometry ofneighboring wires, and materials.

• Shorter wires are faster.

• Fatter wires are faster.

• FPGAs have special routing resources for long wires.

• CMOS processes use higher metal layers for long wires, theselayers have wires with muchlarger cross sections than lower levels of metal.

More on this in section 5.4.

5.1.5 Summary of Delay Factors

Name Symbol DefinitionSkew Difference in arrival times for different clock

signalsJitter Difference in clock period over timeClock-to-Q TCO Delay from clock signal to Q output of flop

Setup TSUD Length of time prior to clock/enable that datamust be stable

Hold THO Length of time after clock/enable that data mustbe stable

Load Delay due to load (fanout/consumers/readers)Interconnect Delay along wire

Table 5.1: Summary of delay factors

5.1.6 Timing Constraints

For a circuit to operate correctly, the clock period must be longer than the sum of the delays shownin table5.1.

Definition Margin: The difference between the required value of a timing parameterand the actual value. Anegative marginmeans that there is a timing violation. Amargin of zero means that the timing parameter is just satisfied: changing the timingof the signals (which would affect the actual value of the parameter) could violate thetiming parameter. Apositive margin means that the constraint for the timingparameter is more than satisfied: the timing of the signals could be changed at least alittle bit without violating the timing parameter.

Note: “Margin” is often called “slack”. Both terms are used commonly.


5.1.6.1 Minimum Clock Period

signal may change

signal is stablea b

clk1 clk2

signal may rise

signal may fall

clk1

clk2

a

b

skew jitter clock-to-Qinterconnect + load setup

clock period

propagation

slack

ClockPeriod >(

Skew+Jitter+TCO+ Interconnect +Load+TSUD

)

Note: The minimum clock period is independent of hold time.

5.1.6 Timing Constraints 309

5.1.6.2 Hold Constraint

clk1

clk2

a

b

clock period

(

Skew+Jitter+THO

)

≤(

TCO+ Interconnect +Load)

5.1.6.3 Example Timing Violations

The figures below illustrate correct timing behaviour of a circuit and then two types of violations:setup violation and hold violation. In the figures, the blackrectangles identify the point where theviolation happens.


a b

clk

a

clk

b

dc

c

Clock-to-Q

Setup

Prop

d

β γ

β

βα γ

α

α

αα

β

Hold

Figure 5.5: Good Timing

α

a

clk

b

c α β

?α?β?

a

clk

b

c

Clock-to-Q

Setup

Prop

d

β γ

β

βα γ

α

α

αα

?α?β?

Figure 5.6: Setup Violation

5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS 311

a b

clk

a

clk

b

dc

c

Hold

d

β γ

β

β γ

?β?γ?

γ

Clock-to-Q

Prop

Figure 5.7: Hold Violation

5.2 Timing Analysis of Latches and Flip Flops

In this section, we show how to find the clock-to-Q, setup, andhold times for latches, flip-flops,and other storage elements.

5.2.1 Simple Multiplexer Latch

We begin our study of timing analysis for storage devices with a simple latch built from an inverterring and multiplexer. There are many better ways to build latches, primarily by doing the designat the transistor level. However, the simplicity of this design makes it ideal for illustrating timinganalysis.

5.2.1.1 Structure and Behaviour of Multiplexer Latch

Two modes for storage devices:• loading data:


– loads input data into storage circuitry

– input data passes through to output

• using stored data

– input signal is disconnected from output

– storage circuitry drives output

i o

clk

Schematic

i o

’1’

Loading / pass-through mode

i o

’0’

Storage mode

Unfold Multiplexer to Simple Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..

ab

s

o

a sel

b

o

Multiplexer: symbol and implementation

d clk

o

Latch implementation

Note: inverters onclk Both of the inverters on theclk signal are needed.Together, they prevent a glitch on theOR gate whenclk is deasserted. Ifthere was only one inverter, a glitch would occur. For more onthis, see sec-tion 5.2.1.6

0

11

10

0

d=’0’ clk=’1’

o1

Loading ’0’

1

00

00

0

d=’1’ clk=’1’

o1

Loading ’1’

5.2.1 Simple Multiplexer Latch 313

010

11

d clk=’0’

o=’0’0

1

Storing ’0’

100

10

d clk=’0’

o=’1’0

0

Storing ’1’

5.2.1.2 Strategy for Timing Analysis of Storage Devices

The key to calculating setup and hold times of a latch, flop, etc is to identify:

1. how the data is stored when not connected to the input (often a pair of inverters in a loop)

2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmissiongate or multiplexor)

3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gateor multiplexor)

0

1

d clk=’0’

o0 1

0

d clk=’1’

o0

Note: Storage devices vs. Signals We can talk about the setup and holdtime of a signal or of a storage device. For a storage device, the setup andhold times are requirements that it imposes upon all environments in which itoperates. For an individual signal in a circuit, there is a setup and hold time,which is the amount of time that the signal is stable before and after a clockedge.


5.2.1.3 Clock-to-Q Time of a Multiplexer Latch

clk d

l1l2

qn q

s2

s1

cn

c2

Figure 5.8: Latch for Clock-to-Q analysis

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

α

α

α

α

α

α

c2

ω

clock-to-Q

Figure 5.9: Waveforms of latch showing Clock-to-Q timing

Assume that input is stable, and then clock signal transitions to cause the circuit to move fromstorage mode to load mode.

Calculate clock-to-Q time by finding delay of critical path from where clock signal enters storagecircuit to where q exits storage circuit.

The path is:clk → cn→ c2→ l2→ qn→ q, which has a delay of 5 (assuming each gate has adelay of exactly one time unit).


5.2.1.4 Setup Timing of a Multiplexer Latch

Storage device transitions from load mode to store mode. Setup is time that input must be stablebefore clock changes.

clk d

l1l2

qn q

s2

s1

cn

c2

Figure 5.10: Latch for Setup Analysis

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

α

α

α

α

α

α

c2

ω

setup + margin

ω

Figure 5.11: Setup with margin: goal is to storeα

Step-by-step animation of latch transitioning from load tostore mode.

clk d α

1 0 1α

α

α α

ααα0

0

Circuit is stable in load mode

clk d α

0 1 0α

0

α α

ααα1

t=3: l2 is set to 0, because c2 turns off AND gate

α


clk d α

0 0 1α

α

α α

ααα0

0

t=0: Clk transitions from load to store

clk d α

0 1 0α

0

α α

ααα1

t=4: α from store path propagates to q

α

clk d α

0 1 1α

α

α α

ααα1

0


clk d α

0 1 0α

0

α α

ααα1

t=5: α from store path completes cycle

α

clk d α

0 1 0α

α

α α

ααα1

t=2: s1 propagates to s2, because cn turns on AND gate

α

The value on s1 at t=1 will propagate from the store loop to theoutput and back through the storeloop. At t=1, s1 must have the value that we want to store. Or, equivalently, the value to store musthave saturated the store loop by t=1. It takes 5 time units fora value on the input d to propagate tos1 (d → l1→ l2→ qn→ q→ s1).

The setup time is the difference in the delay from d to s1 and the delay from clk to cn: 5−1 = 4,so the setup time for this latch is 4 time units.


d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α ω

ω

ω

ω

setup with negative margin

c2

ω

ω

ω

ω

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α



Step-by-step animation of latch transitioning from load tostore mode with setup violation whereα arrives 1 time-unit before the rising edge of the clock.

clk d

1 0 1ω

ω

ω ω

ωωω0

0

Circuit is stable in load mode with ω

ωclk

d α αα

ω ω

ωωω

0

t=1: α propagates through ANDClk propagates through inverter

0 1 1

1

clk d α

1 0 1ω

ω

ω ω

ωωω0

0

t=-1: D transitions from ω to α

Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.

clk d α α

α

α ω

ωωω

0

ω

t=2: old ω propagates through AND

1 0

1

clk d α

0 1α

ω

ω ω

ωωω0

0

t=0: α propagates through inverterClk transitions from load to store

α0

clk d α α

0

α

αωω


ω

0 1 0

1ω/α


clk d α α

ω ω/α

ω/ααα

ω

0 1 0

1

t=4: ω/α from store path propagates to q

clk d α=1

0 1 00

0

0 1

1111

t=5: Illustrate instability with ω=0, α=1

0

clk d α

0 1 0α

0

ω

ωω/αω/α

1α

t=5: ω/α from store path completes cycle

ω

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α ω

ω

ω

ω


c2

ω

ω

ω

ω

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α

-3 -2 -1 0 1 2 3 4 5 6


We now repeat the analysis of setup violation, but illustrate the minimum violation (input transi-tions fromω to α 3 time-units before the clock edge).

clk d

1 0 1ω

ω

ω ω

ωωω0

0

Circuit is stable in load mode with ω

ω

clk d α

1 0 1α

α

ω ω

ωωω0

0

t=-1: α propagates through AND

clk d α

1 0 1ω

ω

ω ω

ωωω0

0

t=-3: D transitions from ω to α

clk d α

0 0 1α

α

α ω

ωωω0

0


clk d α

1 0 1α

ω

ω ω

ωωω0

0

t=-2: α propagates through inverter

α

clk d α

0 1 1α

α

α α

αωω1

0

t=1: Clk propagates through inverter

clk d α

0 1 0α

α

α α

ααα1

t=2: old ω propagates through AND

ω

Trouble: inconsistent values on load path and store path.Old value (ω) still in store path when store path is enabled.

clk d α

0 1 0α

0

α α

αω/αω/α

1

t=5: ω/α from store path completes cycle

α


clk d α

0 1 0α

0

ω/α α

ααα1


α

clk d α=1

0 1 00

0

0 1

1111

t=5: Illustrate instability with ω=0, α=1

0

clk d α

0 1 0α

0

α ω/α

ω/ααα1

t=4: ω/α from store path propagates to q

α

d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α α

α

α

α


c2

α

α

α

α

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α

-3 -2 -1 0 1 2 3 4 5 6


d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

ω

α ω

ω

ω

ω


c2

ω

ω

ω

ω

ω

ω

α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α α/ω

α


d ω

l1

l2

qn

q

s1

s2

clk

cn

ω

ω

ω

ω

ω

α

α

α

α

α

α

α

setup

c2

α

α

α

α

α

α

α

α

α

α

α

α

α

α

α

α

Figure 5.14: Minimum Setup Time

Whencn is asserted,α must be ats1 . Otherwise,ω will affect storage circuitry when data inputis disconnected.


5.2.1.5 Hold Time of a Multiplexer Latch

clk d

l1l2

qn q

s2

s1

cn

c2

Figure 5.15: Latch for Hold Analysis

d α β

l1

l2

qn

q

s1

α β

s2

clk

c2

α

α

α

α

cn

α

α

hold + margin

α

Figure 5.16: Hold OK: goal is to storeα


clk d α

1 0α

α

α α

αα0

0

Circuit is stable in load mode

1 clk d α

0 1α

α

α α

αα1

t=6: Clk transition propagates to c2,l1 may change now without affecting storage device

0

α

clk d α

0 0α

α

α α

αα0

0


1 clk d α

0 1α

α α

αα1

t=7: Clk transition propagates to l2,

00

α

clk d α

0 1α

α

α α

αα1

0

t=5: Clk transition propagates to cn

1

Figure 5.17: Animation of hold analysis

It takes 6 time units for a change on the clock signal to propagate to the input of theAND gate thatcontrols the load path. It takes 1 time unit for a change on d topropagate to its input to thisAND

gate. The data input must remain stable for 6−1 = 5 time units after the clock transitions fromload to store mode, or else the new data value (e.g.,β) will slip into the storage loop and corruptthe valueα that we are trying to store.


d ω β

l1

l2

qn

q

s1

β

s2

clk

c2

ω

ω

ω

ω

ω

α

α

α

α

α

α

cn

β

β

β

β

β

β β

hold with negative margin

Figure 5.18: Hold violation:β slips through to q

d ω β

l1

l2

qn

q

s1

β

s2

clk

c2

ω

ω

ω

ω

ω

α

α

α

α

α

α

cn

α

hold

Figure 5.19: Minimum Hold Time

Can’t letβ affect l1 beforec2 deasserts.

Hold time is difference between path fromclk to c2 and path fromd to l1 .


5.2.1.6 Example of a Bad Latch

This latch is very similar to the one from section 5.2.1.5, however this one does not work correctly.The difference between this latch and the one from section 5.2.1.5 is the location of the inverterthat determines whetherl2 or s2 is enabled. When the clock signal is deasserted,c2 turns off theAND gatel2 before theAND gates2 turns on. In this interval when bothl2 ands2 are turnedoff, a glitch is allowed to enter the feedback loop.

The glitch on the feedback loop is independent of the timing of the signalsd andclk .

clk d

l1l2

qn q

s2

s1

cn

c2

d α β

l1

l2

qn

q

s1

α β

s2

clk

c2

α

α

α

α

cn

α

α

α

α

α

α

α

α

α

5.2.2 Timing Analysis of Transmission-Gate Latch

The latch that we now examine is more realistic than the simple multiplexer-based latch. Wereplace the multiplexer with atransmission gate.

5.2.2 Timing Analysis of Transmission-Gate Latch 327

5.2.2.1 Structure and Behaviour of a Transmission Gate

Symbol

i o

s’

s

Implementation

i o

’1’

’0’

Open

i o

’0’

’1’

Closed

’0’

’1’

’1’

Transmit’1’

’0’

’1’

’0’

Transmit’0’

i os

Transmission gate as switch

5.2.2.2 Structure and Behaviour of Transmission-Gate Latch

(Smith 2.5.1)

d

clk

q

d

clk

q

1

0

1

0

1

Loading data into latch

d

clk

q

10

1

0

1

Using stored data from latch


5.2.2.3 Clock-to-Q Delay for Transmission-Gate Latch

d

clk

q

1

5.2.2.4 Setup and Hold Times for Transmission-Gate Latch

d

clk

q

1

path1

path2

Setup time = path1 – path2Setup time for latch

d

clk

q

1 path1

path2

Hold time = path1 – path2Hold time for latch

5.2.3 Falling Edge Flip Flop

(Smith 2.5.2)

We combine two active-high latches to create a falling-edge, master-slave flip flop. The analysisof the master-slave flip-flop illustrates how to do timing analysis for hierarchical storage devices.Here, we use the timing information for the active high latchto compute the timing informationof the flip-flop. We do not need to know the primitive structureof the latch in order to derive thetiming information for the flip flop.

5.2.3 Falling Edge Flip Flop 329

5.2.3.1 Structure and Behaviour of Flip-Flop

EN EN

d m q

clk

A

??

B C D E Fα β

A B α D E β

α β

d

clk

m

clk_b

q ??

EN EN

d m q

clk

α

d

clk

m

clk_b

q

LatchClock-Q

TInv

LatchSetup

TmdTinv

α

α

TInv delay through an inverterTmd propagation delay frommto d


5.2.3.2 Clock-to-Q of Flip-Flop

EN EN

d m q

clk

α

d

clk

m

clk_b

q

LatchClock-to-Q

Tinv

α

α

FlopClock-to-Q

TCOFlop= TInv+TCOLatch

5.2.3 Falling Edge Flip Flop 331

5.2.3.3 Setup of Flip-Flop

EN EN

d m q

clk

α

d

clk

m

clk_b

q

α

α

FlopSetup

LatchSetup

TSUDFlop= TSUDLatch

The setup time of the flip flop is the same as the setup time of themaster latch. This is because,once the data is stored in the master latch, it will be held forthe slave latch.


5.2.3.4 Hold of Flip-Flop

EN EN

d m q

clk

α

d

clk

m

clk_b

q

α

α

Hold time for latch

β

Hold time for flop

THOFlop= THOLatch

The hold of the flip flop is the same as the hold time of the masterlatch. This is because, once thedata is stored in the master latch, it will be held for the slave latch.

5.2.4 Timing Analysis of FPGA Cells

(Smith 5.1.5)

We can apply hierarchical analysis to structures that include both datapath and storage circuitry.We use an Actel FPGA cell to illustrate. The description of the Actel FPGA cell in the course notesis incomplete, refer to Smith’s book for additional material.

5.2.4 Timing Analysis of FPGA Cells 333

5.2.4.1 Standard Timing Equations

TPD = delay from D-inputs to storage elementTCLKD = delay from clk-input to storage elementTOUT = delay from storage element to output

TSUD = setup time= “slowest D path”− “fastest clk path”= TPD Max−TCLKD Min

THO = hold time= “slowest clk path”− “fastest D path”= TCLKD Max−TPD Min

TCO = delay clk to Q= “clk path”+ “output path”= TCLKD+TOUT

5.2.4.2 Hierarchical Timing Equations

Add combinational logic to inputs, clock, and outputs of storage element.

t’SUD

HOt’t’CO

PDt’

CLKDt’

data inputs

clk

d q

clk

t’OUT

TSUD = TSUD′+T’PD Max−T’CLKD Min

THO = THO′+T’CLKD Max−T’PD Min

TCO = TCO′+T’CLKD Max+TOUT Max

5.2.4.3 Actel Act 2 Logic Cell

Timing analysis of Actel Act 2 logic cell (Smith 5.1.5).


Actel ACT• Basic logic cells are called Logic Module

• ACT 1 family: one type of Logic Module (see Figure 5.1, Smith’s pp. 192)

• ACT 2 and ACT 3 families: use two different types of Logic Module (see Figure 5.4,Smith’s pp. 198)

• C-Module (Combinatorial Module) — combinational logic similar to ACT 1 Logic Mod-ule but capable of implementing five-input logic function

• S-Module (Sequential Module) — C-Module + Sequential Element (SE) that can be con-figured as a flip-flop

Actel Timing• ACT family: (see Figure 5.5, Smith’s pp. 200)

• Simple. Why?

– Only logic inside the chip

– Not exact delay (as no place and route, physical layout, hence not accounting for inter-connection delay)

– Non-Deterministic Actel Architecture

• All primed parameters inside S-Module are assumed — Calculate tSUD, tH, and tCO

• The combinational logic delay of 3 ns: 0.4 went into increasing the setup time, tSUD, and2.6 ns went into increasing the clock-output delay, tCO. From outside we can say that thecombinational logic delay is buried in the flip-flop set up time

d

clk

q

Simple Actel-style latch

dclk

q

clr

Actel latch with active-lowclear

dclk

m

clr

q

Actel flop with active-low clear

5.2.4 Timing Analysis of FPGA Cells 335

clk

m

clr

q

d00 d01 d10 d11

a1 b1 a0 b0

C-ModuleSE-Module

se_clk se_clk_n

Actel sequential module

5.2.4.4 Timing Analysis of Actel Sequential Module

Timing parameters for Actel latchwith active-low clear

TSUD 0.4nsTHO 0.9nsTCO 0.4ns

Other given timing parameters

C-Module delay (t′PD) 3nst’CLKD (from clk to se clk and seclk n) 2.6ns

Question: What are the setup, hold, andTCO times for the entire Actel sequentialmodule?

Answer:

See Smith pp 199. Use Smith’s eqn 5.15, 5.16, and assume t ′CLKD = 2.6ns.

TSUD 0.8nsTHO 0.5nsTCO 3.0ns


5.2.5 Exotic Flop

As a contrast to the gate-level implementations of latches that we looked at previously, the figurebelow is the schematic for a state-of-the-art high-performance latch circa 2001.

d

clk

q

inverter chain

precharge node precharge node keeperkeeper

The inverter chain creates anevaluation windowin time when clock has just risen and the p tran-sistors are turned on.

When clock is’0’ , the left precharge node charges to’1’ and the right precharge node dischargesto ’0 .

If d is ’1’ during the evaluation window, the left precharge node discharges to’0’ . The leftprecharge nodes goes through an inverter to the second precharge node, which will charge from’0’ to texttt’1’, resulting in a’0’ on q.

If d is ’0’ during the evaluation window, the left precharge node staysat the precharge value of’1’ . The left precharge nodes goes through an inverter to the second precharge node, which willstay at’0’ , resulting in a’1’ onq.

The two inverter loops arekeepers, which provide energy to keep the precharge nodes at theirvalues after the evaluation window has passed and the clock is still ’1’ .

5.3 Critical Paths and False Paths

5.3.1 Introduction to Critical and False Paths

In this section we describe how to find thecritical path through the circuit: the path that limits themaximum clock speed at which the circuit will work correctly. A complicating factor in finding the

5.3.1 Introduction to Critical and False Paths 337

critical path is the existence offalse paths: paths through the circuit that appear to be the criticalpath, but in fact willnot limit the clock speed of the circuit. The reason that a path isfalse is thatthe behaviour of the gates prevents a transition (either 0→ 1 or 1→ 0) from travelling along thepath from the source node the destination node.

Definition critical path: The slowest path on the chip between flops or flops and pins.The critical path limits the maximum clock speed.

Definition false path: : a path along which an edge cannot travel from beginning to end.

To confirm that a path is a true critical path, and not a false path, we must find a pair of inputvectors that exercise the critical path. The two input vectors usually differ only their value for theinput signal on the critical path.1 The change on this signal (either 0→ 1 or 1→ 0) must propagatealong the candidate critical path from the input to the output.

Usually the two input vectors will produce different outputvalues. However, a critical path mightproduce a glitch (0→ 1→ 0 or 1→ 0→ 1) on the output, in which case the path is still the criticalpath, but the two input vectors both result in the same value on the output signal. Glitches should beignored, because they may result in setup violations. If theglitching value is inside the destinationflop or latch at the end of the clock period, then the storage element will not store a stable value.

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .

The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. Thealgorithm to find the critical path through a circuit is presented in several parts.

1. Section 5.3.2: Find the longest path ignoring the possibility of false paths.

2. Section 5.3.3: Almost-correct algorithm to test whethera candidate critical path is a falsepath.

3. Section 5.3.4: If a candidate path is a false path, then findthe next candidate path, and repeatthe false-path detection algorithm.

4. Section 5.3.5: Correct, complete, and complex algorithmto find the critical path in a circuit.

1Section 5.3.5 discusses late-side inputs and situations where more than one input needs to change for the criticalpath to be exercised.


Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Note: The analysis of critical paths and false paths assumes that all inputschange values at exactly the same time. Timing differences between inputs aremodelled by theskew parameter in timing analysis.

Throughout our discussion of critical paths, we will use thedelay values for gates shown in thetable below.

gate delayNOT 2AND 4OR 4XOR 6

5.3.1.1 Example of Critical Path in Full Adder

Question: Find the critical path through the full-adder circuit shownbelow.

ci a b

co

si

jk

Answer:

Annotate with Max Distance to Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ci a b

co

s6

60

8

84 4

40

14

148

8

84

0

08

14

14

Find Candidate Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


ci a b

co

s6

60

8

84 4

40

14

148

8

84

0

08

14

14

There are two paths of length 14: a–co and b–co. We arbitrarily choosea–co.

Test if Candidate is Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ci a b

co

s

’0’’0’’0’

’0’ ’0’ ’1’

’1’

Yes, the candidate path is the critical path.

The assignment of ci=1, a=0, b=0 followed in the next clock cycle by ci=1,a=1, b=0 will exercise the critical path. As a shortcut, we write the pair ofassignments as: ci=1, a=↑, b=0.

Question: Do the input values of ci=0, a=↓, b=1 exercise the critical path?

Answer:

ci a b

co

s

’0’

’0’

’1’ ’1’

’1’

’0’’0’

The alternative does not exercise the critical path. Instead, the alternativeexcitation follows a shorter path, so the output stabilizes sooner.

Lesson: not all all transitions on the inputs will exercise the critical path.Using timing simulation to find the maximum clock speed of a circuit mightoverestimate the clock speed, because the inputs values that you simulatemight not exercise the critical path.


5.3.1.2 Preliminaries for Critical Paths

There are three classes of paths on a chip:

• entry path: from an input to a flop

Quartus does not report this by default. When Quartus reports this path, it is reported as theperiod associated with “System fmax”.

In Xilinx timing reports this is reported as “Maximum Delay”

• stage path: from one flop to another flop

In Quartus timing reports, this is reported as the period associated with “Internal fmax”.

In Xilinx timing reports, this is reported as “Clock to Setup” and “Maximum Frequency”.

• exit path: from a flop to an output

Quartus does not report this by default. When Quartus reports this path, it is reported as theperiod associated with “System fmax”.

In Xilinx timing reports this is reported as “Maximum Delay”

5.3.1.3 Longest Path and Critical Path

The longest path through the circuit might not be the critical path, because the behaviour of thegates might prevent an edge (0→ 1 or 1→ 0) from travelling along the path.

Example False Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .

Question: Determine whether the longest path in the circuit below is a false path

ya

b


Answer:

For this example, we use a very naive approach simply to illustrate thephenomenon of false paths. Sections 5.3.2–5.3.5 present a better algorithmto detect false paths and find the real critical path.

In the circuit above, the longest path is from b to y:

The four possible scenarios for the inputs are:

(a = 0, b = 0→ 1)(a = 0, b = 1→ 0)(a = 1, b = 0→ 1)(a = 1, b = 1→ 0)

a = 0, b = 0→ 1 a = 0, b = 1→ 0

ya

b

ya

b

0

00

00

0

a = 1, b = 0→ 1 a = 1, b = 1→ 0

ya

b

0

00

00

0y

a

b

1

11

1

In each of the four scenarios, the edge is blocked at either the AND gate orthe OR gate. None of the four scenarios result in an edge on the output y, sothe path from b to y is a false path.

Question: How can we determine analytically that this is a false path?

Answer:

The value on a will always force either the AND gate to be a ’0’ (when a is’0’) or the the OR gate to be a ’1’ (when a is ’1’). For both a=’0’ anda=’1’, a change on b will be unable to propagate to y. The algorithm todetect false paths is based upon this type of analysis.


Preview of Complete Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

This example illustrates all of the concepts in analysing critical paths. Here, we explore the circuitinformally. In section 5.3.5, we will revisit this circuit and analyse it according to the complete,correct, and complex algorithm.

Question: Find the critical path through the circuit below.

a b

c

d ef

g

Answer:

Even though the equation for this circuit reduces to false, the output signal (g)is not a constant ’0’. Instead, glitches can occur on g. To explore thebehaviour of the circuit, we will stimulate the circuit first with a falling edge,then a rising edge.

Stimulate the circuit with a falling edge and see which path the edge follows.

0 0 2 4 6

02

0

10a b

c

d ef

g

The longest path through the circuit is the middle path.

At g, the side input (a) has a controlling value before the falling edge arriveson the path input (e). Thus, a falling edge is unable to excite the longest paththrough the circuit.

Stimulate the circuit with a rising edge and see which path the edge follows.

0 0 2 4 6

02

0

610

a b

c

d ef

g

At f, the side input c has a controlling value before the falling edge arrives onthe path input (e). Thus, a rising edge is unable to excite the longest paththrough the circuit.

5.3.2 Longest Path 343

Of the two scenarios, the falling edge follows a longer path through the circuitthan the rising edge. The critical path is the lower path through the circuit.

When we develop our first algorithm to detect false paths (section 5.3.3), wewill assume that at each gate, the input that is on the critical path will arriveafter the other inputs. Not all circuits satisfy the assumption. At f, when a is afalling edge, the path input (c) arrives before the side input e. Thisassumption is removed in section 5.3.5, where we present the completealgorithm by dealing with late-arriving side inputs.

5.3.1.4 Timing Simulation vs Static Timing Analysis

The delay through a component is usually dependent upon the values on signals. This is becausedifferent paths in the circuit have different delays and some input values will prevent some pathsfrom being exercised. Here are two simple examples:

• In a ripple-carry adder, if a carry out of the MSB is generatedfrom the least significant bit,then it will take longer for the output to stabilize than if nocarries generated at all.

• In a state machine using a one-hot state encoding, false paths might exist when more thanone state bit is a’1’ .

Because of these effects, static timing analysis might be overly conservative and predict a delaythat is greater than you will experience in practice. Conversely, a timing simulation may notdemonstrate the actual slowest behaviour of your circuit: if you don’t ever generate a carry fromLSB to MSB, then you’ll never exercise the critical path in your adder. The most accurate delayanalysis requires looking at the complete set of actual datavalues that will occur in practice.

5.3.2 Longest Path

The following is an algorithm to find the longest path from a set of source signals to a set ofdestination signals. We first provide a high-level, intuitive, description, and then present the actualalgorithm.

Outline of Algorithm to Find Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The basic idea is to annotate each signal with the maximum delay from it to an output.• Start at destination signals and traverse through fanin to source signals.

– Destination signals have a delay of 0

– At each gate, annotate the inputs by the delay through the gate plus the delay of the output.


– When a signal fans out to multiple gates, annotate the outputof the source (driving) gate withmaximum delay of the destination signals.

• The primary input signal with the maximum delay is the start of the longest path. The delayannotation of this signal is the delay of the longest path.

• The longest path is found by working from the source signal tothe destination signals, pickingthe fanout signal with the maximum delay at each step.

Algorithm to Find Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . .

1. Set current time to 0

2. Start at destination signals

3. For each input to a gate that drives a destination signal, annotate the input with the currenttime plus the delay through the gate

4. For each gate that has times on all of its fanout but not a time for itself,

(a) annotate each input to the gate with the maximum time on the fanout plus the delaythrough the gate

(b) go to step 4

5. To find the longest path, start at the source node that has the maximum delay. Work forwardthrough the fanout. For signals that fanout to multiple signals, choose the fanout signal withthe maximum delay.

Longest Path Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .

Question: Find the longest path through the circuit below.

a

b

c

l

m

d

e

fg

h

i

j

k

Answer:

Annotate signals with the maximum delay to an output:

5.3.3 Detecting a False Path 345

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k

Find longest path:

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k

The path from a to y has a delay of 16.

5.3.3 Detecting a False Path

In this section, we will explore a simple andalmost correctalgorithm to determine if a path is afalse path. The simple algorithm in this section sometimes gives the incorrect results if the candi-date path intersects false paths. For all of the example circuits in this section, the algorithm givesthe correct result. The purpose of presenting this almost-correct algorithm is that it is relativelyeasy to understand and introduces one of the key concepts used in the complicated, correct, andcomplete algorithm for finding the critical path in section 5.3.5.

5.3.3.1 Preliminaries for Detecting a False Path

Thecontrolling valueof a gate is the value such that if one of the inputs has this value, the outputcan be determined independently of the other inputs.

For anAND gate, the controlling value is’0’ , because when one of the inputs is a’0’ , we knowthat the output will be’0’ regardless of the values of the other inputs.

Thecontrolled output valueis the value produced by the controlling input value.

Gate Controlling Value Controlled OutputAND ’0’ ’0’OR ’1’ ’1’

NAND ’0’ ’1’NOR ’1’ ’0’XOR none none


Path Input, Side Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .

Definition path input: For a gate on a path (either a candidate critical path, or a realcritical path), thepath inputis the input signal that is on the path.

Definition side input: For a gate on a path (either a candidate critical path, or a realcritical path), theside inputsare the input signals that arenot on the path.

The key idea behind the almost-correct algorithm is that: for an edge to propagate along a path,the side inputs to each gate on the path must have non-controlling values. The complete, correct,and complicated algorithm generalizes this constraint to handle circuits where the side inputs areon false paths.

Reconvergent Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Definition reconvergent fanout: There are paths from signals in the fanout of a gate thatreconverge at another gate.

Most of the difficulties both with critical paths and with testing circuits for manufacturing faults(Chapter 7) are caused by reconvergent fanout.

ya

b

c

z d e

f

h

g

There are two sets of reconvergent paths in the circuit above. One set of reconvergent paths goesfrom a to y and one set goes fromd to z .

If a candidate path has reconvergent fanout, then the risingor falling edge on the input to the pathmight cause a side input along the path to have a rising or falling edge, rather than a stable’0’ or’1’ .

To support reconvergent fanout, we extend the rule for side inputs having non-controlling valuesto say that side inputs must have either non-controlling values or have edges that stabilize in non-controlling values.


Rules for Propagating an Edge Along a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

These rules assume that side inputs arrivebefore path inputs. Section 5.3.5 relaxes this constraint.

1 1

0 0

1 1

0 0

NOT

AND

OR

XOR

Question: Why do the rules not have falling edges forAND gates or rising edges forOR gates on the side input?

Answer:

ab c

a

b

c

For an AND gate, a falling edge on side-input will force the output to changeand prevent the path input from affecting the output. This is because the finalvalue of a falling edge is the controlling value for an AND gate. Similarly, for anOR gate, the final value of a rising edge is the controlling value for the gate.


Analyzing Rules for Propagating Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The pictures below show all combinations of output edge (rising or falling) and input values (con-stant 1, constant 0, rising edge, falling edge) forAND andOR gates. These pictures assume thatthe side input arrivesbefore the path intput. The pictures that are crossed out illustrate situationsthat prevent the path input from affecting the output. In these situations the inputs cause either aconstant value on the output or the side input affects the output but the path input does not. Thepictures that are not crossed out correspond to the rules above for pushing edges throughAND andOR gates.

0 1

0 1

constant 0 output

0 is controlling

constant 0 output

constant 0 output

AND

0 1

0 1

constant 1 output

1 is controllingconstant 1 output

constant 1 output

OR

Viability Condition of a Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . .

Definition Viability condition: For a path (p) though a circuit, theviability condition(sometimes called theviability constraint) is a Boolean expression in terms of theinput signals that defines the cases where an edge will propagate along the path.Equivalently: the cases where a transition on the primary input to the path will excitethe path.

Based upon the rules for propagating an edge that we have seenso far, the viability condition fora path is:every side input has a non-controlling value. As always, section 5.3.5 has the completeviability condition.


5.3.3.2 Almost-Correct Algorithm to Detect a False Path

The rules above for propagating an edge along a candidate path assume that the values on sideinputs always arrivebefore the value on the path input. This is always true when the candidatepath is the longest path in the circuit. However, if the longest path is a false path, then when we aretesting subsequent candidate paths, there is the possibility that a side input will be on a false pathand the side input value will arrivelater than the value from the path input.

This almost-correct algorithm assumes that values on side inputs always arrive before values onpath inputs. The correct, complex, and complete critical path algorithm in section 5.3.5 extendsthe almost correct algorithm to remove this assumption.

To determine if a path through a circuit is a false path:

1. Annotate each side input along the path with its non-controlling value. These annotationsare the constraints that must be satisfied for the candidate path to be exercised.

2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuitunder consideration.

3. If there is a contradiction amongst the constraints, thenthe candidate path is a false path.

4. If there is no contradiction, then the constraints on the inputs give the conditions under whichan edge will traverse along the candidate path from input to output.

5.3.3.3 Examples of Detecting False Paths

False-Path Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .

Question: Determine if the longest path in the circuit below is a false path.

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k

Answer:

Compute constraints for side inputs to have non-controlling values:


a

b

c

l

m‘1’

‘1’‘0’‘1’

Contradictory values.

‘0’

01

d

e

f g

h

i

j

k

side input non-controlling value constraintg[b] 1 bi[e] 0 ck[h] 1 b

Found contradiction between g[b] needing b and k[h] needing b, therefore thecandidate path is a false path.

Analyze cause of contradiction:

a

b

c

l

m

2

These side inputs will always have opposite values. Both side inputsfeed the same type of gate (AND),so it always be the case that one of the side inputs will be a controlling value (0).

d

e

f g

h

i

j

k


Question: Determine if the longest path through the circuit below is a critical path.If the longest path is a critical path, find a pair of input vectors that will exercise thepath.

a

c

b

e

df

gh


Answer:

a

c

b

e

df

gh

‘1’‘0’

‘1’

side input non-controlling value constrainte[a] 1 ag[b] 0 bh[f] 1 a+b

Complete constraint is conjunction of constraints: ab(a+b), which reduces tofalse. Therefore, the candidate path is a false path.


This example illustrates a candidate path that is a true path.

Question: Determine if the longest path through the circuit below is a critical path. Ifthe longest path is a critical path, find a pair of input vectors that will exercise thepath.

a

c

b

e

df

gh

Answer:

Find longest path; label side inputs with non-controlling values:

a

c

b

e

df

gh

‘0’‘0’

‘1’


Table of side inputs, non-controlling values, and constraints on primary inputs:

side input non-controlling value constrainte[a] 0 ag[b] 0 bh[b] 1 a+b

The complete constraint is ab(a+b), which reduces to ab. Thus, for an edgeto propagate along the path, a must be ’0’ and b must be ’0’.

The primary input to the path (c) does not appear in the constraint, thus bothrising and falling edges will propagate along the path. If the primary input tothe path appears with a positive polarity (e.g. c) in the constraint, then only arising edge will propagate. Conversely, if the primary input appears negated(e.g., c), then only a falling edge will propagate.

The primary input to the path (c) does not appear in the constraint, thus bothrising and falling edges will propagate along the path. If the primary input tothe path appears with a positive polarity (e.g. c) in the constraint, then only arising edge will propagate. Conversely, if the primary input appears negated(e.g., c), then only a falling edge will propagate.

Critical path c, e, g, hDelay 14Input vector a=0, b=0, c=rising edge

Illustration of rising edge propagating along path:

a

c

b

e

df

gh

‘0’‘0’

‘1’

‘0’

‘0’ ‘1’ ‘1’‘1’

‘0’

Illustration of falling edge propagating along path:

a

c

b

e

df

gh

‘0’‘0’

‘1’

‘0’

‘0’ ‘1’ ‘1’‘1’

‘0’


This example illustrates reconvergent fanout.


Question: Determine if the longest path through the circuit below is a critical path. Ifthe longest path is a critical path, find a pair of input vectors that will exercise thepath.

ac

be

d

f

g

Answer:

ac

be

d

f

g

‘1’

‘1’

side input non-controlling value constrainte[b] 1 bg[d] 1 a

The complete constraint is ab.

The constraint includes the input to the path (a), which indicates that not alledges will propagate along the path. The polarity of the path input indicatesthe final value of the edge. In this case, the constraint of a means that weneed a rising edge.

Critical path a, c, e, f, gDelay 12Input vector a=rising edge, b=1

Illustration of rising edge propagating along path:

ac

be

d

f

g

‘1’‘1’

If we try to propagate a falling edge along the path, the falling edge on theside input d forces the output g to fall before the arrival of the falling edge onthe path input f. Thus, the edge does not propagate along the candidatepath.


ac

be

d

f

g

‘1’‘1’‘1’

Patterns in False Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .

After analyzing these examples, you might have begun to observe some patterns in how false pathsarise. There are several patterns in the types of reconvergent fanout that lead to false paths. Forexample, if the candidate path has anOR gate and anAND that are both controlled by the samesignal and the candidate has an even number of inverters between these gates then the candidatepath is almost certainly a false path. The reason is the same as illustrated in the first example of afalse path. The side input will always have a controlling value for either theOR gate or theAND

gate.

5.3.4 Finding the Next Candidate Path

If the longest path is a false path, we need to find the next longest path in the circuit, which will beour next candidate critical path. If this candidate fails, we continue to find the next longest of theremaining paths,ad infinitum.

5.3.4.1 Algorithm to Find Next Candidate Path

To find the next candidate path, we use apath table, which keeps track of the partial paths thatwe have explored, their maximum potential delay, and the signals that we can follow to extend apartial path toward the outputs. We keep the path table sorted by the maximum potential delay ofthe paths. We delete a path from the table if we discover that it is a false path.

The key to the path table is how to update the potential delay of the partial paths after we discovera false path. All partial paths that are prefixes of the false path will need to have their potentialdelay values recomputed. The updated delay is found by following the unexplored signals in thefanout of the end of the partial path.

1. Initialize path table with primary inputs, their potential delay, and fanout.

2. Sort path table by potential delay (path with greatest potential delay at bottom of table)

3. If the partial path with the maximum potential delay has just one unused fanout signal,then extend the partial path with this signal.Otherwise:

(a) Create a new entry in the path table for the partial path extended by the unused fanoutsignal with the maximum potential delay.

5.3.4 Finding the Next Candidate Path 355

(b) Delete this fanout signal from the list of unused fanout signals for the partial path.

4. Compute the constraint that side input of the new signal does not have a controlling value,and update constraint table.

5. If the new constraint does not cause a contradiction,then return to step 3.Otherwise:

(a) Mark this partial path asfalse.

(b) For each partial path that is a prefix of the false path:• reduce the potential delay of the path by the difference between the potential delay

of the fanout that was followed and the unused fanout with next greatest delay value.

(c) Return to step 2

5.3.4.2 Examples of Finding Next Candidate Path

Next-Path Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Question: Starting from the initial delay calculation and longest path, find the nextcandidate path and test if it is a false path.

a

b

c

0

14 12 1212

6 44

8 88

44

8 2 016

12

10

d

e

f g

h

i

j

k

Answer:

Initial state of path table:

potential unuseddelay fanout path10 e c12 h, g b16 d a

Extend path with maximum potential delay until find contradiction or reachend of path. Add an entry in path table for each intermediate path withmultiple signals in the fanout.


Path table and constraint table after detecting that the longest path is a falsepath:

potential unuseddelay fanout path10 e c12 h, g b16 j, i a, d, f, gfalse a, d, f, g, i, k

side input non-controlling value constraintg[b] 1 bi[e] 0 ck[h] 1 b

The longest path is a false path. Recompute potential delay of all paths inpath table that are prefixes of the false path.

The one path that is a prefix of the false path is: 〈a,d,f,g〉. The remainingunused fanout of this path is j, which has a potential delay on its input of 2.The previous potential delay of g was 8, thus the potential delay of the prefixreduces by 8−2 = 6, giving the path a potential delay of 16−6 = 10.

Path table after updating with new potential delays:

potential unuseddelay fanout pathfalse a, d, f, g, i, k10 e c10 i a, d, f, g12 h, g b

Extend b through g, because g has greater potential delay than the otherfanout signal (h).

potential unuseddelay fanout pathfalse a, d, f, g, i, k10 e c10 i a, d, f, g12 h, g b12 i, j b, g

side input non-controlling value constraintg[a] 1 a

From g, we will follow i, because it has greater potential delay than j.


potential unuseddelay fanout pathfalse a, d, f, g, i, k10 e c10 i a, d, f, g12 h, g b12 i, j b, g12 b, g, i, k

side input non-controlling value constraintg[a] 1 ai[e] 0 ck[h] 1 b

We have reached an output without encountering a contradiction in ourconstraints. The complete constraint is abc.

Critical path b, g, i, kDelay 12Input vector a=1, b=falling edge, c=1

Illustrate the propagation of a falling edge:

a

b

c

2d

e

f g

h

i

j

k

‘1’

‘0’‘1’

‘1’

At k, the rising edge on the side input (h) arrives before the falling edge onthe path input (i). For a brief moment in time, both the side input and pathinput are ’1’, which produces a glitch on k.

Next-Path Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Question: Find the critical path in the circut below

a b

c d

e

fg h

ij

k

l m m

k


Answer:

Find the longest path:a b

c d

e

fg h

ij

k

l m m

k

0

0

26

6

6

04

4

10

10101416

20

2022

14

14

10

20

4


potential unuseddelay fanout path4 k e10 j, l a14 i b20 g c22 f d

Extend path with maximum potential delay until find contradiction or reachend of path. Add an entry in path table for each intermediate path withmultiple fanout signals.

potential unuseddelay fanout path4 k e10 j, l a14 i b20 g c22 j, k d, f, g, h, ifalse d, f, g, h, i, j, l

side input non-controlling value constraintg[c] 1 ci[b] 0 bj[a] 0 al[a] 1 a

Contradiction between j[a] and l[a], therefore the path 〈d,f,g,h,i,j,l〉 isa false path. And, any path that extends this path is also false.

To find next candidate, begin by recomputing delays along the candidatepath. The second gate in the contradiction is l. The last intermediate pathbefore l with unused fanout is i. Cut the candidate path at this signal. The


remaining initial part of the candidate path is: d, f, g, h, i. The only unusedfanout of this path is k.

We now calculate the new maximum potential delay of 〈d, f, g, h, i〉, takinginto account the false path that we just discovered. The delay from i alongthe candidate path 〈j, l, m〉 is 10 and the maximum potential delay along theremaining unused (k) is 4. The difference is: 10−4 = 6, and so the potentialdelay of 〈d, f, g, h, i〉 is reduced to 22−6 = 16.

After updating the partial delay of 〈d, f, g, h, i〉, the partial path with themaximum potential delay is c. The new critical path candidate will be: c, g, h,i, j, l, m.

Update the path table with delay of 16 for previous candidate path. Extend calong path with maximum potential delay until find contradiction or reach endof path. Add an entry in path table for each intermediate path with multiplefanout signals.

potential unuseddelay fanout pathfalse d, f, g, h, i, j, l4 k e10 j, l a14 i b16 k d, f, g, h, i20 k c, f, g, h, ifalse c, f, g, h, i, j, l

We encounter the same contradiction as with the previous candidate, and sowe have another false path. We could have detected this false path withoutworking through the path table, if we had recognized that our currentcandidate path overlaps with the section (j, l) of previous candidate thatcaused the false path.

As with the previous candidate, we reduce the potential delay of the currentcandidate the path up through i by 6, giving us a potential delay of20−10= 14 for 〈c, f, g, h, i〉. The next candidate path is 〈d, f, g, h, i, k〉with a delay of 16.

potential unuseddelay fanout pathfalse d, f, g, h, i, j, lfalse c, f, g, h, i, j, l4 k e10 j, l a14 i b14 k c, f, g, h, i16 k d, f, g, h, i


We extend the path through k and compute the constraint table.

side input non-controlling value constraintg[c] 1 ci[b] 0 bk[e] 0 e

The complete constraint is bce. There is no constraint on a and d may beeither a rising edge or a falling edge.

Critical path d, f, g, h, i, kDelay 16Input vector a=0, b=0, c=1, d=rising edge, e=0

Next Path Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Question: Find the critical path in the circuit below.

m

a

b

pd

c

e

f

j

k

g h i

lm

n

o

p

Answer:

m

a

b

pd

c

e

f

j

k

g h i

lm

n

o

p

0

04

446

48

8

8

816

4

4

48

8

8

812

1012

1214



potential unuseddelay fanout path8 n, o d12 j, k a14 e b16 f c

Extend c through f:

potential unuseddelay fanout path8 n, o d12 j, k a14 e b16 m, n c,f,g,h,ifalse c,f,g,h,i,n,p

side input non-controlling value constraintn[d] 1 dp[o] 1 d

The first candidate is a false path. Recompute potential delay of c, f, g, h, i,which reduces it from 16 to 12.

potential unuseddelay fanout pathfalse c,f,g,h,i,n,p8 n, o d12 j, k a12 m c,f,g,h,i14 e b

Extend b through e:

potential unuseddelay fanout pathfalse c,f,g,h,i,n,p8 n, o d12 j, k a12 m c,f,g,h,ifalse b,e,k,l

side input non-controlling value constraintk[a] 1 al[j] 1 a


The second candidate is a false path. There is no unused fanout signal froml for the path b, e, k, l, so this partial path is a false path and there is no newdelay information to compute.

There are two paths with a potential delay of 12. Choose 〈c,f,g,h,i〉,because the end of the path is closer to an output, so there will be less workto do in analyzing the path.

potential unuseddelay fanout pathfalse c,f,g,h,i,n,pfalse b,e,k,l8 n, o d12 j, k a12 c,f,g,h,i,m

side input non-controlling value constraintm[l] 0 ¬(a∗ (ab)) = true

Critical path c,f,g,h,i,mDelay 12Input vector a=0, b=1, c=rising edge, d=0

5.3.5 Correct Algorithm to Find Critical Path

In this section, we remove the assumption that values on sideinputs always arrive earlier than thevalue on the path input. We now deal withlate arriving side inputs, or simply “late side inputs”.

The presentation of late side inputs is as follows:

Section 5.3.5.1 rules for how late side inputs can allow path inputs to exercise gates

Section 5.3.5.2 idea of monotone speedup, which underlies some of the rules

Section 5.3.5.3 one of the potentially confusing situations in detail.

Section 5.3.5.4 complete, correct, and complex algorithm.

Section 5.3.5.5 examples

5.3.5.1 Rules for Late Side Inputs

For each gate, there are eight sitations: the side input is controlling or non-controlling, the pathinput is controlling or non-controlling, and the side inputarrives early or arrives late.

5.3.5 Correct Algorithm to Find Critical Path 363

Early Side

monotone speedup side input causes glitchpath input propogates

Late Side

path=CTRLside=non-ctrl

path=non-ctrl path=CTRL path=non-ctrlside=non-ctrl side=CTRL side=CTRL

path input causes glitch path input propogates neither input propogatesside input propogates

monotone speedup

Early Side

monotone speedup side input causes glitchpath input propogates

Late Side

path=CTRLside=non-ctrl

path=non-ctrl path=CTRL path=non-ctrlside=non-ctrl side=CTRLside=CTRL

path input causes glitchpath input propogates neither input propogatesside input propogates

monotone speedup

Late side inputs give us three more situations for each ofAND andOR gates where the path inputwill/might excite the gate. In the two cases labeledmonotone speedup, the path input does notexcite the gate with the current timing, but if our timing estimates for the side input are too slow,or the timing of the side speeds up due to voltage or temperature variations, then the late inputmight become an early side input.

The five situations where the path input excites the gate are:side is early

side=non-ctrl, path=non-ctrl The path input is the later of the two inputs to transition toa non-controlling value, so it is the one that causes the output to transition.

side=non-ctrl, path=ctrl The side input transitions to a non-controlling value whilethepath input is a non-controlling value; this causes the output to transition to anon-controlled value. The path input then transitions to a controllilng value, causing aglitch on the output as it transitions to a controlled value.

side is lateside=non-ctrl, path=non-ctrl If the side input arrives earlier than expected, then we will

have an early arriving side input with a non-controlling value.side=non-ctrl, path=ctrl If the side input arrives earlier than expected, then we willhave

an early arriving side input with a non-controlling value.

side=ctrl, path=ctrl The path input transitions to a controlling value before thesideinput; so, it is the input that causes the output to transition.


The three situations where the path input does not excite thegate are:

side is earlyside=ctrl, path=ctrl The side input transitions to a controlling value before thepath input

transitions to a controlling value. The edge on the path input does not propagate to theoutput.

side=ctrl, path=non-ctrl It is always the case that at least one of the inputs is acontrolling value, so the output of the gate is a constant controlled value.

side is lateside=ctrl, path=non-ctrl The path input transitions to a non-controlling value whilethe

side input is still non-controlling. This causes the outputto transition to anon-controlled value. The side input then transitions to a controlling value, whichcauses the glitch as the output transitions to a controlled value. The second edge of theglitch is caused by the side input, so the side input determines the timing of the gate.

Combining together the five situations where the path input excites the gate gives us our completeand correct rule: a path input excites the gate ifthe side-input is non-controlling or the side-inputarrives late and the path input is controlling.

Section 5.3.5.2 discusses monotone speedup in more detail,then section 5.3.5.3 demonstrates thata late-arriving side input that causes a glitch cannot result in a true path. After these two tangents,we finally present the correct, complete, and complex algorithm for critical path analysis.

5.3.5.2 Monotone Speedup

When we have a late side input with a non-controlling value, the path input does not excite thegate, but the rules state that we should consider this to be a true path. The reason that we reportthis as a true path, even though the path input does not excitethe gate is due to the idea ofmonotone speedup.

Definition monotonic: A function (f ) is monotonicif increasing its input causes theoutput to increase or remain the same. Mathematically:x < y =⇒ f (x)≤ f (y).

Definition monotononous: A lecture ismonotonousif increasing the length of thelecture increases the number of people who are asleep.

Definition monotone speedup: The maximum clockspeed of a circuit should bemonotonic with respect to the speed of any gate or sub-circuit. That is, if we increasethe speed of part of the circuit, we should either increase the clockspeed of thecircuit, or leave it unchanged.


Definition monotononous speedup: A lecture hasmonotonous speedupif increasing thepace of the lecture increases the number of people who are awake.

In the monotone speedup situations, if we were to report the candidate path as false and the sideinput arrives sooner than expected, the path might generatean edge. Thus, a path that we initiallythought was a false path becomes a real path. Speeding up a part of the circuit turned a false pathinto a real path, and thereby actuallyreducedthe maximum clock speed of the circuit.

Monotone speedup is desirable, because if we claim that a circuit has a certain minimum delayand then speed up some of the gates in the circuit (because of resizing gates, process variations,temperature or voltage fluctuations), we would be quite distraught to discover that we have in factincreased the minimum delay.

We can see the rationale behind the monotone speedup rules byobserving that if we have a lateside input that transitions to a non-controlling value, andthe circuitry that drives the late sideinput speeds up, the late side input might become an early side input. For each of the twomonotone speedup situations, the corresponding early sideinput situation has a true path.

5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation

In the following paragraphs we analyze the rule for a late side input where the side input iscontrolling and the path input is non-controlling. The excitation rules say that in this situation thepath input cannot excite gate. We might be tempted to think that we could construct a circuitwhere the first edge of the glitch (which is caused by the path input) propagates and the secondedge (which is caused by the late side input) does not propagate. Here we demonstrate why wecannot create such a circuit. Readers who are willing to accept that the Earth is round withoutpersonally circumnavigating the globe may wish to skip to section 5.3.5.4.

In the picture below,c is the gate that produces a glitching output because of a late-arriving sideinput. We know that〈a,c〉 is part of a false path and will demonstrate that in the current situation,〈b,c〉must also be part of a false path.

a b

c

For 〈a,c〉 to be a part of a false path, there must be a gate that appears later in the circuit thatprevents the second edge of the glitch from propagating. In the figure below, this later gate isf ,with e being the path input (fromc ) andd being the side input.

d f

e d f

e d f

e

very early side ctrl middling early side ctrl late side ctrl

d f

e late side non-ctrl

d f

e d f

e very early side non-ctrl middling early side non-ctrl


For the first edge one to propagate, the side input (d) must have a non-controlling value at thetime of the first edge. To prevent the second edge of the glitchfrom propagating frome to f , dmust be a controlling value. That is,d must transition from a non-controlling value to acontrolling value in the middle of the glitch one. This corresponds to the “middling early sidectrl” situation in the figure. From the perspective of the first edge of the glitch, this is identical tothe situation with the first gate (c ), in that a late-arriving side input transitions to a controllingvalue.

In this case of “middling early side ctrl”, the edge ond arrives later than the first edge one,which means that〈d,f〉 is a slower path than〈b,c, ...,e,f〉, which means that〈d,f〉 is part of afalse path. Thus, there is a gate later in the circuit that prevents the second edge of the glitch onffrom propagating. We wrap up the argument that the situationillustrated witha, b, c cannot leadto a critical path through〈b,c〉 in two ways: intuitively and mathematically.

Intuitively, for 〈b,c〉 to be part of a critical path,c must be followed byf , which itself must befollowed by another gate with a middling-early side input. All of the other cases that prevent thesecond edge of the glitch from propagating will prevent bothedges of the glitch frompropagating. This other gate with the middling-early side input produces a glitch and so mustitself be followed by yet another gate with a middling side input. This process continuesadinfinitum— we cannot construct a finite circuit that allows the first edge of the glitch onc topropagate and prevents a second edge of the glitch from propagating.

Mathematically, we construct a simple inductive proof based on the number of later gates in thecandidate path. In the base case,f is the last gate in the path, and so it must be the gate thatpropagates the first edge of the glitch and does not generate aglitch. There is no situation inwhich this happens, thus the last gate in the path cannot havea middling-early input. In theinductive case we assume that there aren gates later in the path and none of them havemiddling-early side inputs. We can then prove that the gate just prior to thenth gate cannot have amiddling-early side input, because for it to have a middling-early side input, one of then latergates would need to have a middling-early side input that would allow the first edge of the glitchto propagate and prevent the second edge of the glitch from propagating. From the inductivehypothesis, we know that none of then gates have a middling-early input, and so we havecompleted the proof by contradiction.

5.3.5.4 Complete Algorithm

The possibility of late-arriving side inputs caused us to modify our rules for when a path inputwill excite a gate. The complete rule (section 5.3.5.1) is:the side-input is non-controlling or theside-input arrives late and the path input is controlling. Because we explore candidate criticalpaths beginning with the slowest and working through fasterand faster paths, a late-arriving sideinput must be part of a previously discovered false path.

In the previous sections, when we did not have late-arrivingside inputs, we could exercise thecritical path a change on just one input signal. With late-arriving side inputs, both theprimary-input to the critical path and the late-arriving side inputs might need to change.


When using the late-arriving side input portion of our excitation rule, we must ensure that the sideinput does in fact arrive later than the path input. If we do not, we would fall into the situationwhere both inputs are controlling and the side input arrivesearly. In this situation, the side inputexcites the gate.

For the side input to arrive late, the late path to the side input must be viable. Stated moreprecisely, the prefix of the previously discovered false path that ends at the side input must beviable. The entire previously discovered false path is clearly not viable, it is only the prefix up tothe side input that must be viable. The viability condition for the prefix uses the same rule as weuse for normal path analysis: for every gate along the prefix the side-input is non-controlling orthe prefix’s side input arrives late and the prefix’s path input is controlling.

The complete, correct, and complex algorithm is:• If find a contradiction on the path, check for side inputs thatare on previously discovered false

paths.

• If a gate and its side input are on a previously discovered false path, then the side input definesa prefix of a false path that is alate-arriving side input.

• For each late-arriving prefix, compute its viability (the conditions under which an edge willpropagate along the prefix to the late side input).

• To the row of the late arriving side input in the constraint table, add as a disjunction theconstraint that: the path input has a controlling value and at least one of the prefixes is viable.

5.3.5.5 Complete Examples

Complete Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .


a b

c

d ef

g

Answer:

a b

c

d ef

g 04

448

8

810

810121414


potential unuseddelay fanout path14 g, b, c afalse a,b,d,e,f,g

side input non-controlling value constraintf[c] 1 ag[a] 1 a

First false path, pursue next candidate.

potential unuseddelay fanout pathfalse a,b,d,e,f,g10 g, c a10 a,c,f,g

side input non-controlling value constraintf[e] 1 ag[a] 1 a

At first, this path appears to be false, but the side input f[e] is on the prefixof the false path a,b,d,e,f,g. Thus, f[e] is a late arriving side input.

The candidate path will be a true path if the side input arrives late and thepath input is a controlling value. The viability condition for the path a,b,d,e istrue. The constraint for the path input (c) to have a controlling value for f is a.Together, the viability constraint of true and the controlling value constraint ofa give us a late-side constraint of a.

Updating the constraint table with the late arriving side input constraint givesus:

side input non-controlling value constraintf[e] 1 a+a = trueg[a] 1 a

The constraint reduces to a. A rising edge will exercise the path.

Critical path a, c, f, gDelay 10Input vector a=rising edge

Illustration of rising edge exercising the critical path:

0 0 2 4 6

02

0

610

a b

c

d ef

g


Complete Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .


a

c i

b d

Answer:

Find longest path:

a

c

h

i jj

i

gb

f

04

44

48

88

8

8

8

12

1212

8

121414e16d18

Explore longest path:

potential unuseddelay fanout path8 f a12 h c18 f, g b,d,e18 h, i b,d,e,gfalse b,d,e,g,h,i,j

side input non-controlling value constrainth[c] 0 ci[g] 0 bj[f] 0 ab

Contradiction.a

c

h

i jj

i

gb

f0

00

00

010ed

First false path, find next candidate.


Changes in potential delays:

Signal / path old newg on 〈b,d,e,g〉 12 8〈b,d,e,g〉 18 14g[e] on 〈b,d,e〉 14 10e on 〈b,d,e〉 14 10〈b,d,e〉 18 14

potential unuseddelay fanout pathfalse b,d,e,g,h,i,j8 f a12 h c14 f, g b,d,e14 b,d,e,g,i,j

a

c

h

i jj

i

gb

f

04

44

48

88

8

8

8

12

1212

8

814 1010ed 12

side input non-controlling value constrainth[c] 0 ci[h] 0 cbj[f] 0 ab

Initially, found contradiction, but 〈b,d,e,g,h〉 is a prefix of a false path, andi[h] is a side input to the candidate path. We have a late side input.

Note that at the time that we passed through i, we could not yet determinethat we would need to use i[h] as a late side input. The lesson is that whena contradiction is discovered, we must look back along the entire candidatepath covered so far to see if we have any late side inputs.

Our late-arriving constraint for i[h] is:• late side path (〈b,d,e,g,h〉) is viable: c.

• path input (i[g]) has a controlling value of ’1’: b.Combining these constraints together gives us bc.

Adding the constraint of the late side input to to the condition table gives us:

side input non-controlling value constrainth[c] 0 ci[h] 0 bc+bc = cj[f] 0 ab


The constraints reduce to abc.

Critical path 〈b,d,e,g,i,j〉Delay 14Input vector a=0, b=falling edge, c=0

Illustration of falling edge exercising the critical path:

a

c

h

i jj

i

gb

f

ed0 2 4

48 8

4 6

610

10

10

0

0

14

Complete Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .This example illustrates the benefits of the principle of monotone speedup when analyzing criticalpaths.

a b

ef

c

d

Critical-path analysis says that the critical path is〈a,c,e,f〉, with a late side input ofe[d] and atotal delay of 10. The required excitation is a rising edge ona. However, with the given delays,this excitation does not produce an edge on the output.

a b

ef

c

d0 0 2 4

0 2

0

For a more complete analysis of the behaviour, we also try a falling edge. The falling edge pathexercises the path〈a,f〉 with a delay of 4.

a b

ef

c

d0 0 2 4

0 2

04

6

Monotone speedup says that if we reduce the delay of any gate,we must not increase the delay ofthe overall circuit. We reduce the delays ofb andd from 2 to 0.5 and produce an edge at time 10via the path〈a,c,e,f〉.


a b

ef

c

d0 0 0.5 1

0 2

0

610

The critical path analysis said that the critical path was〈a,c,e,f〉 with a delay of 10. With theoriginal circuit, the slowest path appeared to have a delay of 4. But, by reducing the delays of twogates, we were able to produce an edge with a delay of 10. Thus,the critical path algorithm didindeed satisfy the principle of monotone speedup.

Complete Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .This example illustrates that we sometimes need to allow edges on the inputs to late side paths.


a b

c de

f g h

i jk

Answer:

The purpose of this example is to illustrate a situation where we need theprimary input of a late-side path to toggle. To focus on the behaviour of thecircuit, we show pictures of different situations and do not include the pathand constraint tables.

Longest path in the circuit, showing a contradiction between e[b] and j[h].

0

101

01a b

c de

f g h

i jk

Second longest path 〈b,f,g,h,i,j,k〉, using only early side inputs, showing acontradiction between k[e] and i[e].

0101a

b

c de

f g h

i jk


Second longest path using late side input i[e], which has a controlling valueof 1 (rising edge) on i[h]. However, we neglect to put a rising edge on a.The late-side path is not exercised and our candidate path is also notexercised.

0

0 2 4 6

6

1 0 11 1

1

1 00a

b

c de

f g h

i jk

We now put a rising edge on a, which causes our late side input (i[e]) to bea non-controlling value when our path input (i[h]) arrives.

2 44

08

4 8

0 2 4 6

6810

10 12

14 16a b

c de

f g h

i jk

0

0

In looking at the behaviour of i, we might be concerned about the precisetiming of of the glitch on e and the rising ege on h. The figure below showsnormal, slow, and fast timing of e. With slow timing, the first edge of glitch one arrives after the rising edge on h. The timing of the second edge of theglitch remains unchanged. The value of i remains constant, which could leadus to believe (incorrectly!) that our critical path analysis needs to take intoaccount the first edge of the glitch. However, this is in fact an illustration ofmonotone speedup. The fast timing scenario move the glitch earlier, such thatthe edge on h does in fact determine the timing of the circuit, in that hproduces the last edge on i. In summary, with the glitch on e and the risingedge on h, either h causes the last edge on i or there is no edge on i.

Normal timing e h

i

4 8 108

6

Slow timing on e e h

i

4 8 108

6

Fast timing on e e h

i

4 8 108

6

Complete Example 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .This example demonsrates that a late side path must be viableto be helpful in making a true path.



a b

c

d

e

f

g

h

i

j

k

Answer:

Find that the two longest paths are false paths, because of contradictionbetween g[d] and i[c].

a b

c

d

e

f

g

h

i

j

k0

1

Try third longest path 〈d,f,h,j,k〉 using early side inputs. Find contradictionbetween k[i] and j[c].

a b

c

d

e

f

g

h

i

j

k

0

1

0

010

Try using late side paths 〈a,e,g,i,k〉 or 〈b,e,g,i,k〉. Find that neither path isviable by itself, because of contradiction between g[d] and i[c]. Also,neither path is viable in conjunction with the candidate path, because ofcontradiction between i[c] on late side path and j[c] on candidate path.Either one of these contradictions by itself is sufficient to prevent the late sidepath from helping to make the candidate path a true path.

a b

c

d

e

f

g

h

i

j

k

0

1

0

0

5.3.6 Further Extensions to Critical Path Analysis

McGeer and Brayton’s paper includes two extensions to the critical path algorithm presented herethat we will not cover.

5.3.7 Increasing the Accuracy of Critical Path Analysis 375

• gates with more than two inputs

• finding all input values that will exercise the critical path

• multiple paths with the same delay to the same gate

5.3.7 Increasing the Accuracy of Critical Path Analysis

When doing critical path calculations, it is often useful tostrike a balance between accuracy andeffort. In the examples so far, we assumed that all signals had the same wire and load delays. Thisassumption simplifies calculations, but reduces accuracy.Section 5.4 discusses how the analogworld affects timing analysis.

5.4 Elmore Timing ModelThere are many different models used to describe the timing of circuits. In the section on criticalpaths, we used a timing model that was based on the size of the gate. The timing model ignoredinterconnect delays and treated all gates as if they had the same fanout. For example, the delaythrough anAND gate was 4, independent of how many gates were in its immediate fanout.

In this section and the next (section 5.4) we discuss two timing models. In this section, we discussthe detailed analog timing model, which reflects quite accurately the actual voltages on differentnodes. The SPICE simulation program uses very detailed analog models of transistors (dozens ofparameters to describe a single transistor). In the next section, we describe the Elmore delaymodel, which achieves greater simplicity than the analog model, but at a loss of accuracy.

5.4.1 RC-Networks for Timing Analysis

Transistor Level(P-Tran)

gate

source

drain

Mask Level (P-Tran)

gate

sourcepoly

p-diff

contact

drain

Cross-Section ofFabricated Transistor

poly

p-diff

contact

substrate

Switch Level (P-Tran)

gate

source

drain

Transistor Level(N-Tran)

gate

source

drain

Mask Level (N-Tran)

gate

sourcepoly

n-diff

drain

contact

Cross-Section ofFabricated Transistor

poly

p-diff

contact

substrate

Switch Level(N-Tran)

gate

source

drain


Different Levels of Abstraction for Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Gate Levela b

Transistor Level

a b

VDD

GND

Mask Level

VDD

GND

a b

poly

n-diff

p-diff

metal

metal

contact

From the electrical characteristics of fabricated transistors, VLSI and device engineers derivemodels of how transistors behave based on mask-level descriptions. For our purposes, we will usethe very simple resistor-capacitor model shown below.

Each of the P- and N-transistor models contains a resistor (“pullup” for the P-transistor and“pulldown” for the N-transistor) and a parasitic capacitor.

When we combine a P-transistor and an N-transistor to createan invertor, we combine thecapacitors into a single parasitic capacitor that is the sumof the two individual capacitors.

RC-Network models of P- and N-transistors

gate

Rpu

RpdCp

source

drain

Cp

source

gate

drain

RC-Network for Timing Analysis

a b

Rpu

Rpd

Cp

VDD

GND

CL

• Contacts (vias) have resistance (RV )

• Metal areas (wires) have resistance (RW) and capacitance (CW).– The resistance is dependent upon the geometry of the wire.

– The capacitance is dependent upon the geometry of the wire and the other wires adjacent toit.

• For most circuits, the via resistance is much greater than the wire resistance (RV ≫ RW)To reduce area, modern wires tend to have tall and narrow cross sections. When wires are packedclose together (e.g. a signal that is an array or vector), thewires act like capacitors.

5.4.1 RC-Networks for Timing Analysis 377

A Pair of Inverters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

Gate Level

ab

c

Transistor Level

ab

VDD

GND

c

Mask Level

ab

c

A Pair of Inverters (Cont’d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .Mask Level

VDD

GND

ab c


ab

Rpu

Rpd

Cp

VDD

GND

c

Rpu

Rpd

CpCL CLCW

RW RV

To analyze the delay from one inverter to the next, we analyzehow long it takes the capacitiveload of the second (destination) inverter to charge up from ground to VDD, or to discharge fromVDD to ground. In doing this analysis, the gate side of the driving inverter is irrelevant and can beremoved (trimmed). Similarly, the pullup resistor, pulldown resistor, and parasitic capacitance ofthe destination inverter can also be removed.

RC-Network for Timing Analysis (trimmed)


Rpu

Rpd

Cp

VDD

GND

CL

RVb

CW

RW

A Circuit with Fanout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .We will look at one more example of inverters and their RC-network before beginning the timinganalysis of these networks.

Gate Level

ab

c

d

Gate Level (physical layout)

ab c

dc

Transistor Level

ab

VDD

GND

c b d

c

Mask LevelVDD

GND

a db

b

c

c

5.4.1 RC-Networks for Timing Analysis 379


a

Rpu

Rpd

Cp

GND

c

Rpu

Rpd

Cpd

Rpu

Rpd

Cp

c

CL CL CL

VDD

b

CW1

RW1 RV

b

CW2

RW2 RV

CW3

RW3

RC-Network for Timing Analysis (trimmed)

Rpu

Rpd

Cp

GND

CL CL

VDD

RV

bRVb

CW1

RW1

CW2

RW2

We will use this circuit as our primary example for the analogand Elmore timing models, so wedraw a simplified version of the trimmed RC-network before proceeding.


RC-Network for Timing Analysis (cleaned up)

Rpu

Rpd

Cp

GND

CL

CL

VDD

RV

b RV

b

CW1

RW1

CW2

RW2

5.4.2 Derivation of Analog Timing Model

The primary purpose of our timing model is provide a mechanism to calculate the approximatedelay of a circuit. For example, to say that a gate has a delay of 100 ps. The actual gate behaviouris a complicated function of the input signal behaviour.

The waveforms below are all possible behaviours of the same circuit. From these variouswaveforms, it would be very difficult to claim that the circuit has a specific delay value.

Slow input

time

inputvoltage

time

outputvoltage

Fast input

time

inputvoltage

time

inputvoltage

Steps Toward Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .We begin with two simplifications as steps toward calculating a single delay value for a circuit.

1. Look at the circuit’s response to a step-function input.

2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% of VDD.

These values of 65% VDD and 35% VDD are “trip points”.

5.4.2 Derivation of Analog Timing Model 381

Definition Trip Points: A high or ’1’ trip point is the voltage level where an upwardstransition means the signal represents a’1’ .

A low or ’0’ trip point is the voltage level where a downwards transitionmeans thesignal represents a’0’ .

In the figure below the gray line represents the actual voltage on a signal. The black line is digitaldiscretization of the analog signal.

a

b

Node Numbering, Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..To motivate our derivation of the analog timing model, we will use the inverter that fans out totwo other inverters as our example circuit.• The source (VDD in our case) and each capacitor is anode. We number the nodes, capacitors,

and resistors. Resistors are numbered according to the capacitor to their right. Multipleresistors in series without an intervening capacitor are lumped into a single resistor.

• All nodes except the source start at GND.

• We calculate the voltage at a node when we turn on the P-transistor (connect to VDD).

The process for analyzing a transition from VDD to GND on a node is the dual of the process justdescribed. The source node is GND, all other nodes start at VDD, we calculate the voltage whenwe turn on the N-transistor (connect it to GND).

Rpu

Rpd

Cp

GND

CL

CL

VDD

RV

b RV

b

CW1

RW1

CW2

RW2

1 2 5

3 40

R1

R2 R5

R3 R4

Define: Path and Downstream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .We still have a few more preliminaries to get through. To discuss the structure of a network, weintroduce two terms:pathanddownstream


Definition path: The path from the source node to a nodei is the set of all resistorsbetween the source andi. Example: path(3) ={R1,R2,R3}

Definition down: The set of capactitors downstream from a node is the set of allcapacitors where current would flow through the node to charge the capacitor. Youcan think of this as the set of capacitors that are between thenode and ground.Example: down(2) ={C2,C3,C4,C5}. Example: down(3) ={C3,C4}

5.4.2.1 Example Derivation: Equation for Voltage at Node 3

As a concrete example of deriving the analog timing model, wederive the equation for the voltageat Node 3 in our example circuit. After this concrete example, we do the general derivation.

V3(t) = V0(t)−voltage drop fromNode0toNode3

The voltage drop is the sum of the voltage drops across theresistors on the path from Node0 to Node3

= V0(t)− ∑r∈path(3)

Rr×Ir(t)

= V0(t)− (R1I1(t)+R2I2(t)+R3I3(t))

The current through a resistor is the sum of the currentsthrough all of the downstream capacitors

Ir(t) = ∑c∈down(r)

Ic

I1(t) = Ic1+ Ic2+ Ic3 + Ic4+ Ic5

I2(t) = Ic2+ Ic3+ Ic4 + Ic5I3(t) = Ic3+ Ic4

SubstituteIr into the equation forV3

V3(t) = V0(t)−

R1(Ic1+ Ic2+ Ic3 + Ic4+ Ic5)+ R2(Ic2+ Ic3+ Ic4 + Ic5)+ R3(Ic3+ Ic4)

Use associativity to group terms by currents.

V3(t) = V0(t)−

Ic1(R1)+ Ic2(R1+R2)+ Ic3(R1+R2+R3)+ Ic4(R1+R2+R3)+ Ic5(R1+R2)

5.4.2 Derivation of Analog Timing Model 383

Current through a capacitor

Ic(t) = Cc∂Vc(t)

∂t

SubstituteIc into equation forV3

V3(t) = V0(t)−

(R1)Cc1∂Vc1(t)

∂t

+ (R1+R2)Cc2∂Vc2(t)

∂t

+ (R1+R2 +R3)Cc3∂Vc3(t)

∂t

+ (R1+R2 +R3)Cc4∂Vc4(t)

∂t

+ (R1+R2)Cc5∂Vc5(t)

∂t

In each of the resistance-capacitance terms (e.g.,(R1 +R2)Cc2), the resistors are the set of resistors on the pathto the capacitor that are also on the path to Node3.We capture this observation by defining theElmore resis-tance Ri,k for a pair of nodesi andk to be the resistors onthe path to Nodei that are also on the path to Nodek.

Ri,k = ∑r∈(path(k)∩path(k))

Rr

R3,1 = R1

R3,2 = R1+R2

R3,3 = R1+R2+R3

R3,4 = R1+R2+R3

R3,5 = R1+R2

SubstituteRi,k intoV3

V3(t) = V0(t)−

R3,1Cc1∂Vc1(t)

∂t+ R3,2Cc2

∂Vc2(t)∂t

+ R3,3Cc3∂Vc3(t)

∂t

+ R3,4Cc4∂Vc4(t)

∂t+ R3,5Cc5

∂Vc5(t)∂t

We are left with a system of dependent equations, in thatV3 is dependent upon all of the voltagesin the circuit. In the general derivation that follows next,we repeat the steps we just did, and thenshow how the Elmore delay is an approximation of this system of dependent differentialequations.

5.4.2.2 General Derivation

We derive the equation for the voltage at Nodei as a function of the voltage at Node0.


Vi(t) = V0(t)−voltage drop fromNode0toNodei

The voltage drop is the sum of the voltage drops across theresistors on the path from Node0 to Nodei

= V0(t)− ∑r∈path(i)

Rr×Ir(t)

The current through a resistor is the sum of the currentsthrough all of the downstream capacitors

Ir(t) = ∑c∈down(r)

Ic

SubstituteIr into the equation forVi

Vi(t) = V0(t)− ∑r∈path(i)

Rr× ∑c∈down(r)

Ic

Use associativity to pushRr into the summation overc


∑c∈down(r)

Rr×Ic

Current through a capacitor

Ic(t) = Cc∂Vc(t)

∂t

SubstituteIc into equation forVi


∑c∈down(r)

Rr×Cc∂Vc(t)

∂t

A little bit of handwaving to prepare for Elmore resistance

Vi(t) = V0(t)− ∑k∈Nodes

∑r∈path(i)∩path(k)

Rr

×Ck∂Vc(t)

∂t

Define Elmore resistanceRi,k

Ri,k = ∑r∈(path(k)∩path(k))

Rr

SubstituteRi,k intoVi

Vi(t) = V0(t)− ∑k∈Nodes

Ri,k×Ck∂Vc(t)

∂t

5.4.3 Elmore Timing Model 385

The final equation above is an exact description of the behaviour of the RC-network model of acircuit. More accurate models would result in more complicated equations, but even this equationis more complicated than we want for calculating a simple number for the delay through a circuit.

The equation is actually a system of dependent equations, inthat each voltageVi is dependentupon all of the voltagesVc in the circuit. Spice and other analog simulators use numericalmethods to calculate the behaviour of these systems. Elmore’s contribution was to find a simpleapproximation of the behaviour of such systems.

5.4.3 Elmore Timing Model

• Assume thatV0(t) is a step function from 0 to 1 at time 0.

• Derive upper and lower bounds forVi(t).

• Find RC time constants for upper and lower bounds.

• Elmore delay is guaranteed to be between upper and lower bounds.

Upper and lower bounds Elmore model RC-network model

TD-TRi

TP-TRi

TRi

TD

TP

Equations for Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Time : 0 TDi −TRi TP−TRi ∞

Upper 1+t−TDi

TP1−

TRi

TPe

TDi −TP− tTRi

Elmore 1−e−t/TDi

Lower 0 1−TDi

t +TRi

1−TDi

TPe

TP−TRi − tTP

Fact: 0≤ TRi ≤ TDi ≤ TP


Definitions of Time Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .

TRi = ∑k∈Nodes

R2k,iCk

Ri,iMathematical artifact, no intuitive meaning

TDi = ∑k∈Nodes

Rk,iCk Elmore delay

TP = ∑k∈Nodes

Rk,kCk RC-time constant for lumped network

Picking the Trip Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .

Vi(t) = VDD(1−e−t/TDi )Pick trip point ofVi(t) = 0.65VDD, then solve for t

0.65VDD = VDD(1−e−t/TDi )

0.35 = e−t/TDi

Take ln of both sidesln0.35 = ln(e−t/TDi )

ln0.35=−1.05≈−1.0−1.0 = −t/TDi

t = TDi

By picking a trip point of 0.65VDD, the time forVi to reach the trip is the Elmore delay.

5.4.4 Examples of Using Elmore Delay 387

5.4.4 Examples of Using Elmore Delay

5.4.4.1 Interconnect with Single Fanout

G1 G2

G1Ra1

C1 Ra2

Ra3

C2C3Ra4

G2Rw1

Rw2Rw3

C1

G1

Vi

Rpu

Rpd

Cp C2

Rw1

C3

Rw2 Rw3

CG2

G2

Ra1 Ra2 Ra3 Ra4

G* gateC* capacitance on wireRa* resistance through antifuseRw* resistance through wire

Question: Calculate delay from gate 1 to gate 2

Answer:

Gate 2 represents node 4 on the RC tree.


τD4 =4

∑k=1

ERk,iCk

= ER1,4C1 +ER2,4C2+ER3,4C3+ER4,4C4

= (Ra1+Rw1 +Ra2+Rw2 +Ra3+Rw3 +Ra4)CG2

+(Ra1+Rw1 +Ra2+Rw2 +Ra3+Rw3)C3

+(Ra1+Rw1 +Ra2+Rw2)C2

+(Ra1+Rw1)C1

approximateRa≫ Rw

= (Ra1)C1+(Ra1+Ra2)C2+(Ra1+Ra2 +Ra3)C3

+(Ra1+Ra2+Ra3 +Ra4)CG2

approximateRai = Raj

= 4(Ra)CG2+3(Ra)C3+2(Ra)C2+(Ra)C1

Question: If you double the number of antifuses and wires needed to connect twogates, what will be the approximate effect on the wire delay between the gates?

Answer:

τDi =n

∑k=1

ERk,iCk

Assume all resistances and capacitances are the samevalues (R and C), and assume that all intermediatenodes are along path between the two gates of inter-est.ERk,i = k×R

τDi = (n

∑k=1

k)RC

Using the mathematical theorem:


n

∑i=1

i =(n+1)n

2≈ n2

We simplify delay equation:

τDi = (n

∑k=1

k)RC

= n2RC

We see that the delay is propotional to the square of the number of antifusesalong the path.

5.4.4.2 Interconnect with Multiple Gates in Fanout

G1 G2

G3 G1

G2

G3

Question: Assuming that wire resistance is much less than antifuse resistance andthat all antifuses have equal resistance, calculate the delay from the source inverter(G1) to G2

Answer:

1. There are a total of 7 nodes in the circuit (n = 7).

2. Label interconnect with resistance and capacitance identifiers.

G1R1

C1

R2

R3

C2

C4R4

G2

C6

R6

R5

G3

C3

C5

C7


3. Draw RC tree

C1

G1

Vi

Rpu

Rpd

Cp C2

R1

C4

R2 R3 R4

C5

G2

C6

R5 R6

C7

G3

C3

n1 n2 n3 n4 n5

n6 n7

4. G2 is node 5 in the circuit (i = 5).

5. Elmore delay equations

τD5 =7

∑k=1

ERk,5Ck

= ER1,5C1+ER2,5C2+ER3,5C3+ER4,5C4

+ER5,5C5+ER6,5C6+ER7,5C7

6. Elmore resistancesER1,5 = R1 = R

ER2,5 = R1 + R2 = 2R

ER3,5 = R1 + R2 = 2R

ER4,5 = R1 + R2 + R3 = 3R

ER5,5 = R1 + R2 + R3 + R4 = 4R

ER6,5 = R1 + R2 = 2R

ER7,5 = R1 + R2 = 2R

7. Plug resistances into delay equations

τD5 = (R)C1+(2R)C2+(2R)C3+(3R)C4+(4R)C5

+(2R)C6+(2R)C7


Delay from G1 to G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

Question: Assuming that wire resistance is much less than antifuse resistance andthat all antifuses have equal resistance, calculate the delay from the source inverter(G1) to G3

Answer:

1. G3 is node 7 in the circuit (i = 7).

2. Elmore delay equations

τDi =n

∑k=1

ERk,iCk

τD7 =7

∑k=1

ERk,7Ck

= ER1,7C1+ER2,7C2+ER3,7C3+ER4,7C4

+ER5,7C5+ER6,7C6+ER7,7C7

3. Elmore resistancesER1,7 = R1 = R

ER2,7 = R1 + R2 = 2R

ER3,7 = R1 + R2 = 2R

ER4,7 = R1 + R2 = 2R

ER5,7 = R1 + R2 = 2R

ER6,7 = R1 + R2 + R5 = 3R

ER7,7 = R1 + R2 + R5 + R6 = 4R

4. Plug resistances into delay equations

τD7 = (R)C1+(2R)C2+(2R)C3+(2R)C4+(2R)C5= +(3R)C6+(4R)C7


Delay to G2 vs G3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .

Question: Assuming all wire segments at same level have roughly the samecapacitance, which is greater, the delay to G2 or the delay toG3?

Answer:

1. Equations for delay to G2 (τD5) and G3 (τD7)

τD5 = (R)C1+(2R)C2+(2R)C3+(3R)C4+(4R)C5+(2R)C6+(2R)C7

τD7 = (R)C1+(2R)C2+(2R)C3+(2R)C4+(2R)C5+(3R)C6+(4R)C7

2. Difference in delays

τD5− τD7 = RC4+2RC5−RC6−2RC7

3. Compare capacitances

C4 ≈ C6

C5 ≈ C7

4. Conclusion: delays are approximately equal.

5.5 Practical Usage of Timing AnalysisSpeed Grading

• Fabs sort chips according to their speed (sorting is known asspeed grading or speedbinning)

• Faster chips are more expensive

• In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wiresbecome a larger portiono of delay, some analysis of wire delays is also being done.

• Propagation delay is the average of the rising and falling propagation delays.

• Typical speed grades for FPGAs:Std standard speed grade1 15% faster than Std2 25% faster than Std

5.5.1 Speed Binning 393

3 35% faster than Std

Worst-Case Timing• Maximum Delay in CMOS. When?

– Minimum voltage

– Maximum temperature

– Slow-slow conditions (process variation/corner which result in slow p-channel andslow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners

• Increasing temperature increases delay

– ⇑ Temp=⇒ ⇑ resistivity

– ⇑ resistivity=⇒ ⇑ electron vibration

– ⇑ electron vibration=⇒ ⇑ colliding with current electrons

– ⇑ colliding with current electrons=⇒ ⇑ delay

• Increasing supply voltage decreases delay

– ⇑ supply voltage=⇒ ⇑ current

– ⇑ current=⇒ ⇓ load capacitor charge time

– ⇓ load capacitor charge time=⇒ ⇓ total delay

• Derating factor is a number used to adjust timing number to account for voltage and tempconditions

• ASIC manufacturers classes, based on variety of environments:VDD TA (ambient temp) TC (case temp)

Commercial 5V ± 5% 0 to +70CIndustrial 5V ± 10% –40 to +85CMilitary 5V ± 10% –55 to +125C

• What is important is the transistor temperature inside the chip, TJ (junction temperature)

5.5.1 Speed Binning

Speed binning is the process of testing each manufactured part to determine the maximum clockspeed at which it will run reliably.

Manufacturers sell chips off of the same manufacturing lineat different prices based on how fastthey will run.

A “speed bin” is the clock speed that chips will be labeled with when sold.

Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that yoursoftware crashes more frequently than your over-stressed hardware will).


5.5.1.1 FPGAs, Interconnect, and Synthesis

On FPGAs 40-60% of clock cycle is consumed by interconnect.

When synthesizing, increasing effort (number of iterations) of place and route can significantlyreduce the clock period on large designs.

5.5.2 Worst Case Timing

5.5.2.1 Fanout delay

In Smith’s book, Table 5.2 (Fanout delay) combines two separate parameters:

• capacitive load delay

• interconnect delay

into a single parameter (fanout). This is common, and fine.

But, when reading a table such as this, you need to know whether fanout delay is combining bothcapacitive load delay and interconnect delay, or is just capacitive load.

5.5.2.2 Derating Factors

Delays are dependent upon supply voltage and temperature.

⇑ Temp =⇒ ⇑ Delay⇑ Supply voltage =⇒ ⇓ Delay

Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

• ⇑ Temp=⇒ ⇑ Delay

– ⇑ Temp=⇒ ⇑ Resistivity of wires

– As temp goes up, atoms vibrate more, and so have greater probability of colliding withelectrons flowing with current.

• ⇑ Supply voltage=⇒ ⇓ Delay

– ⇑ Supply voltage=⇒ ⇑ current (V = IR)

– ⇑ current=⇒ ⇓ time to charge load capacitors to threshold voltage

5.5.2 Worst Case Timing 395

Derating Factor Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .A “derating factor” is a number to adjust timing numbers to account for different temperature andvoltage conditions.

Excerpt from table 5.3 in Smith’s book (Actel Act 3 derating factors):

Derating factor Temp Vdd1.17 125C 4.5V1.00 70C 5.0V0.63 -55C 5.5V


5.6 Timing Analysis Problems

P5.1 Terminology

For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answerwhich time periods (one or more of t1 – t9 or NONE) are examplesof the term.

NOTES:1. The timing diagram shows the limits of the allowed times (either minimum or maximum).

2. All timing parameters are non-negative.

3. The signal “a” is the input to a rising-edge flop and “b” is the output. The clock is “clk1”.

signal may change

signal is stable

t10 t11

clk1

clk2

b

a

b

t1 t2

t3

t9

t6

t7t4

t5

t8

clock skewclock periodsetup timehold time

P5.2 Hold Time Violations

P5.2.1 Cause

What is the cause of a hold time violation?

P5.3 Latch Analysis 397

P5.2.2 Behaviour

What is the bad behaviour that results if a hold time violation occurs?

P5.2.3 Rectification

If a circuit has a hold time violation, how would you correct the problem with minimal effort?

P5.3 Latch Analysis

Does the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q,setup, and hold times; and answer whether it is active-high or active-low.

Gate DelaysAND 4OR 2NOT 1

en

d

q


P5.4 Critical Path and False Path

Find the critical path through the following circuit:

a

b c d

e

fg

hi

j

klm

P5.5 Critical Path 399

P5.5 Critical Path

a

bc

d

e

f

g

h l

i

j

k

m


Assume all delay and timing factors other than combinational logic delay are negligible.

P5.5.1 Longest Path

List the signals in the longest path through this circuit.

P5.5.2 Delay

What is the combinational delay along the longest path?

P5.5.3 Missing Factors

What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not takeinto account?

P5.5.4 Critical Path or False Path?

Is the longest path that you found a real critical path, or a false path? If it is a false path, find thereal critical path. If it is a critical path, find a set of assignments to the primary inputs thatexercises the critical path.


P5.6 YACP: Yet Another Critical Path

Find the critical path in the circuit below.

a b

c

d

e

f

g h

P5.7 Timing Models 401

P5.7 Timing Models

In your next job, you have been told to use a “fanout” timing model, which states that the delaythrough a gate increases linearly with the number of gates inthe immediate fanout. You dimlyrecall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore,El-Morre, or something like that.

For the circuit shown below as a schematic and as a layout, answer whether the fanout timingmodel closely matches the delay values predicted by the Elmore delay model.

G1

G2

G3

G4

G5

G1

G2 G3 G4 G5

Gate

Interconnect level 2


Symbol Description Capacitance

Cg

Cx

Cy

Resistance

Antifuse R

0

0

0

0

Assumptions:

• The capacitance of a node on a wire is independent of where thenode is located on the wire.


P5.8 Short Answer

P5.8.1 Wires in FPGAs

In an FPGA today, what percentage of the clock period is typically consumed by wire delay?

P5.8.2 Age and Time

If you were to compare a typical digital circuit from 5 years ago with a typical digital circuittoday, would you find that the percentage of the total clock period consumed by capacative loadhas increased, stayed the same, or decreased?

P5.8.3 Temperature and Delay

As temperature increases, does thedelay through a typical combinational circuit increase, staythe same, or decrease?

P5.9 Worst Case Conditions and Derating Factor

Assume that we have a ’Std’ speed grade Actel A1415 (an ACT 3 part) Logic Module that drives4 other Logic Modules:

P5.9.1 Worst-Case Commercial

Estimate the delay under worst-case commercial conditions(assume that the junction temperatureis the same as the ambient temperature)

P5.9.2 Worst-Case Industrial

Find the derating factor for worst-case industrial conditions and calculate the delay (assume thatthe junction temperature is the same as the ambient temperature).

P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature

Estimate the delay under the worst-case industrial conditions (assuming that the junctiontemperature is 105C).

Chapter 6

Power Analysis and Power-Aware Design

6.1 Overview

6.1.1 Importance of Power and Energy

• Laptops, PDA, cell-phones, etc — obvious!

• For microprocessors in personal computers, every watt above 40W adds $1 to manufacturingcost

• Approx 25% of operating expense of server farm goes to energybills

• (Dis)Comfort of Unix labs in E2

• Sandia Labs had to build a special sub-station when they tookdelivery of Teraflops massivelyparallel supercomputer (over 9000 Pentium Pros)

• High-speed microprocessors today can run so hot that they will damage themselves — Athlonreliability problems, Pentium 4 processor thermal throttling

• In 2000, information technology consumed 8% of total power in US.

• Future power viruses: cell phone viruses cause cell phone torun in full power mode andconsume battery very quickly; PC viruses that cause CPU to meltdown batteries

6.1.2 Industrial Names and Products

All of the articles and papers below are linked to from the Documentation page on the E&CE 327web site.

Overview white paper by Intel:

PC Energy-Efficiency Trends and TechnologiesAn 8-page overview of energy and power trends,written in 2002. Available from the web at an intolerably long URL.

403

404 CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

AMD’s Athlon PowerNow!Reduce power consumption in laptops when running on batteryby allowing software toreduce clock speed and supply voltage when performance is less important than battery life.

Intel SpeedstepReduce power consumption in laptops when running on batteryby reducing clock speed to70-80% of normal.

Intel X-ScaleAn ARM5-compatible microprocessor for low-power systems:

http://developer.intel.com/design/intelxscale/

Synopsys PowerMillA simulator that estimates power consumption of the circuitas it is simulated:

http://www.synopsys.com/products/etg/powermillds.html

DEC / Compaq / HP Itsy A tiny but powerful PDA-style computer running linux andX-windows. Itsy was created in 1998 by DEC’s Western Research Laboratory to be anexperimental platform in low-power, energy-efficient computing. Itsy lead to the iPAQPocketPC.

www.hpl.hp.com/techreports/Compaq-DEC/WRL-2000-6.html

www.hpl.hp.com/research/papers/2003/handheld.html

Satellites Satellites run on solar power and batteries. They travel great distances doing verylittle, then have a brief period very intense activity as they pass by an astronomical object ofinterest. Satellites need efficient means to gather and store energy while they are flyingthrough space. Satellites need powerful, but energy efficient, computing andcommunication devices to gather, process, and transmit data. Designing computing devicesfor satellites is an active area of research and business.

6.1.3 Power vs Energy

Most people talk about “power” reduction, but sometimes they mean “power” and sometimes“energy.”• Power minimization is usually about heat removal

• Energy minimization is usually about battery life or energycostsType Units Equivalent Types Equations

Energy Joules Work = Volts×Coulombs= 1

2×C×Volts2

Power Watts Energy / Time = Volts× I= Joules/sec

6.1.4 Batteries, Power and Energy 405

6.1.4 Batteries, Power and Energy

6.1.4.1 Do Batteries Store Energy or Power?

Energy = Volts×Coulombs

Power =EnergyTime

Batteries rated in Amp-hours at a voltage.

battery = Amps×Seconds×Volts

= CoulombsSeconds ×Seconds×Volts

= Coulombs×Volts

= Energy

Batteries store energy.

6.1.4.2 Battery Life and Efficiency

To extend battery life, we want to increase the amount of workdone and/or decrease energyconsumed.

Work and energy are same units, therefore to extend battery life, we truly want to improveefficiency.

“Power efficiency” of microprocessors normally measured inMIPS/Watt. Is this a real measure ofefficiency?

MIPsWatts = millions of instructions

Seconds × SecondsEnergy

= millions of instructionsEnergy

Both instructions executed and energy are measures of work,so MIPs/Watt is a measure ofefficiency.

(This assumes that all instructions perform the same amountof work!)


6.1.4.3 Battery Life and Power

Question: Running a VHDL simulation requires executing an average of 1millioninstructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, andburns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of mycomputer’s clock cycles go towards running VHDL simulations, how manysimulation steps can I run on one battery charge?

Answer:

Outline of approach:

1. Unify the units2. Calculate amount of energy stored in battery3. Calculate energy consumed by each simulation step4. Calculate number of simulation steps that can be run

Unify the units:

Amp (current) Coulomb/secVolt (potential difference, energy per charge) Joule/CoulombWatt (power) Joule/sec

Energy stored in battery:

Ebatt = check equation by checking the units= AmpHours×Vbatt

= Amp×hour×sechour

×Volt

=Coulomb

sec×hour×

sechour

×Joule

Coulomb

= Joule

unit match, do the math= 2.5AH×3600sec/hour×10V

= 90 000Joules

Energy per simulation step:

6.1.4 Batteries, Power and Energy 407

Estep check the units= Watts× . . .

=Joulesec

×seccyc×

cycinstr

×instrstep

=Joulestep

units check, do the math

= 70Watts×1

700×106cyc/sec×1.0cyc/instr×106instr/step

= 0.1Joule/step

= 0.1Joule/step

Number of steps:

NumSteps =EbattEstep

=90 000

0.1

= 900,000steps

Question: If I use the SpeedStep feature of my computer, my computer runs at600MHz with 60W of power. With SpeedStep activated, much longer can I keep thecomputer running on one battery?

Answer:

Approach:

1. Calculate uptime with Speedstep turned off (high power)2. Calculate uptime with Speedstep turned on (low power)3. Calculate difference in uptimes

High-power uptime:


TH =Ebatt

PH

=90 000Watt-Secs

70Watt

= 1285Secs

= 21minutes

Low-power uptime:

TL =Ebatt

PL

=90 000Watt-Secs

60Watt

= 1500Secs

= 25minutes

Difference in uptimes:

Tdiff = TL−TH

= 25−21

= 4minutes

Analysis:

This question is based on data from a typical laptop. So, why are thepredicted uptimes so much shorter than those experienced in reality?

Answer: The power consumption figures are the maximum peak powerconsumption of the laptop: disk spinning, fan blowing, bus active, allperipherals active, all modules on CPU turned on. In reality, laptop almostnever experience their maximum power consumption.

Question: With SpeedStep activated, how many more simulation steps can I run onone battery?

6.2. POWER EQUATIONS 409

Answer:

Clock speed is proportional to power consumption. In both high-power andlow-power modes, the system runs the same number of clock cycles on theenergy stored in the battery. So, we are run the same number of simulationsteps both with and without SpeedStep activated.

Analysis:

In reality, with SpeedStep activated, I am able to run more simulation steps.Why does the theoretical calculation disagree with reality?

Answer: In reality, the processor does not use 100% of the clock cycles forrunning the simulator. Many clock cycles are “wasted” while waiting for I/Ofrom the disk, user, etc. When reducing the clock speed, a smaller number ofclock cycles are wasted as idle clock cycles.

6.2 Power Equations

Power = SwitchPower+ShortPower︸︷︷︸

+ LeakagePower︸︷︷︸

DynamicPower StaticPower

Dynamic Power dependent upon clock speed

Switching Power useful — charges up transistors

Short Circuit Power not useful — both N and P transistors are on

Static Power independent of clock speed

Leakage Power not useful — leaks around transistor

Dynamic power is proportional to how often signals change their value (switch).• Roughly 20% of signals switch during a clock cycle.

• Need to take glitches into account when calculating activity factor. Glitches increase theactivity factor.

• Equations for dynamic power contain clock speed and activity factor.


6.2.1 Switching Power

1->00->1CapLoad

Charging a capacitor

0->11->0CapLoad

Disharging a capacitor

energy to (dis)charge capacitor=12×CapLoad×VoltSup2

When a capacitorC is charged to a voltageV, the energy stored in capacitor is12CV2.

The energy required to charge the capacitor from 0 toV isCV2. Half of the energy (12CV2 isdissipated as heat through the pullup resistance. Half of energy is transfered to the capacitor.

When the capacitor discharges fromV to 0, the energy stored in the capacitor (12CV2) is

dissipated as heat through the pulldown resistance.

f ′: frequency at which invertor goes throughcomplete charge-discharge cycle. (eqn 15.4 inSmith)

average switching power= f ′×CapLoad×VoltSup2

ClockSpeed clock speedActFact average number of times that signal switches from 0→ 1 or from

1→ 0 during a clock cycle

average switching power=12×ActFact×ClockSpeed×CapLoad×VoltSup2

6.2.2 Short-Circuited Power 411

6.2.2 Short-Circuited Power

Vi Vo

IShort

VoltSup

GND

VoltThresh

VoltSup - VoltThresh

P-trans on

N-trans on

TimeShort

Gate Voltage

PwrShort = ActFact×ClockSpeed×TimeShort× IShort×VoltSup

6.2.3 Leakage Power

N-substrate

P

Vi

Vo

N N P

P

Cross section of invertor showing parasiticdiode

I

V

ILeak

Leakage current through parasitic diode

PwrLk = ILeak×VoltSup

ILeak ∝ e

(−q×VoltThresh

k×T

)


6.2.4 Glossary

ClockSpeed def Clock speedaka f

ActFact def activity factoraka A

=NumTransitions

NumSignals×NumClockCycles= Per signal: percentage of clock cycles when signal changes value.

= Per clock cycle: percentage of signals that change value per clockcycle.Note: When measuring per circuit, sometimes approximate bylooking only at flops, rather than every single signal.

TimeShort def short circuit timeaka τ= Time that both N and P transistors are turned on when signal changes

value.MaxClockSpeed def Maximum clock speed that an implementation technology can sup-

port.aka fmax

∝(VoltSup−VoltThresh)2

VoltSupVoltSup def Supply voltage

aka VVoltThresh def Threshold voltage

aka Vth

= voltage at which P transistors turn onILeak def Leakage current

aka IS (reverse biassaturationcurrent)

∝ e

(−q×VoltThresh

k×T

)

IShort def Short circuit currentaka Ishort

= Current that goes through transistor network while both N and P tran-sistors are turned on.

CapLoad def load capacitanceaka CL

PwrSw def switching power (dynamic)= 1

2×ActFact×ClockSpeed×CapLoad×VoltSup2

PwrShort def switching power (dynamic)= ActFact×ClockSpeed×TimeShort× IShort×VoltSup

PwrLk def leakage power (static)= ILeak×VoltSup

Power def total power= PwrSw+PwrShort +PwrLk

6.2.5 Note on Power Equations 413

q def electron charge= 1.60218×10−19C

k def Boltzmann’s constant= 1.38066×10−23 J/K

T def temperature in Kelvin

6.2.5 Note on Power Equations

The power equation:

Power = DynamicPower+StaticPower= PwrSw+PwrShort +PwrLk= (ActFact×ClockSpeed× 1

2CapLoad×VoltSup2)+ (ActFact×ClockSpeed×TimeShort× IShort×VoltSup)+ (ILeak×VoltSup)

is for an individual signal.

To calculate dynamic power forn signals with differentCapLoad, TimeShort, andIShort:

DynamicPower = (n

∑i=1

ActFacti×12

CapLoadi×ClockSpeed×VoltSup2)

+ (n

∑i=1

ActFacti×ClockSpeed×TimeShorti× IShorti×VoltSup)

If know theaverageCapLoad, TimeShort, andIShort for a collection ofn signals, then theabove formula simplifies to:

DynamicPower = (n×ActFactAVG×12CapLoadAVG×ClockSpeed×VoltSup2)

+ (n×ActFactAVG×ClockSpeed×TimeShortAVG× IShortAVG×VoltSup)

If capacitances and short-circuit parameters don’t have aneven distribution, thendon’t averagethem. If high-capacitance signals have high-activity factors, then averaging the equations willresult inerroneously lowpredictions for power.


6.3 Overview of Power Reduction Techniques

We can divide power reduction techniques into two classes: analog and digital.

analogParameters to work with:

capacitance for example, Silicon on Insulator (SOI)resistance for example, copper wiresvoltage low-voltage circuits

Techniques:

dual-VDD Two different supply voltages: high voltage for performance-criticalportions of design, low voltage for remainder of circuit. Alternatively, can varyvoltage over time: high voltage when running performance-critical software andlow voltage when running software that is less sensitive to performance.

dual-Vt Two different threshold voltages: transistors with low threshold voltage forperformance-critical portions of design (can switch more quickly, but moreleakage power), transistors with high threshold voltage for remainder of circuit(switches more slowly, but reduces leakage power).

exotic circuits Special flops, latches, and combinational circuitry that run at a highfrequency while minimizing power

adiabatic circuits Special circuitry that consumes power on 0→ 1 transitions, butnot 1→ 0 transitions. These sacrifice performance for reduced power.

clock trees Up to 30% of total power can be consumed in clock generation andclock tree

digitalParameters to work with:

capacitance (number of gates)activity factorclock frequency

Techniques:

multiple clocks Put a high speed clock in performance-critical parts of design and alow speed clock for remainder of circuit

clock gating Turn off clock to portions of a chip when it’s not being useddata encoding Gray coding vs one-hot vs fully encoded vs ...glitch reduction Adjust circuit delays or add redundant circuitry to reduce or

eliminate glitches.asynchronous circuits Get rid of clocks altogether....

Additional low-power design techniques for RTL from a Qualis engineer:http://home.europa.com/ ˜ celiac/lowpower.html

6.4. VOLTAGE REDUCTION FOR POWER REDUCTION 415

6.4 Voltage Reduction for Power Reduction

If our goal is to reduce power, the most promising approach isto reduce the supply voltage,because, from:

Power = (ActFact×ClockSpeed× 12CapLoad×VoltSup2)

+ (ActFact×ClockSpeed×TimeShort× IShort×VoltSup)+ (ILeak×VoltSup)

we observe:

Power ∝ VoltSup2

Reducing Difference Between Supply and Threshold Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . .As the supply voltage decreases, it takes longer to charge upthe capacitive load, which increasesthe load delay of a circuit.

In the chapter on timing analysis, we saw that increasing thesupply voltage will decrease thedelay through a circuit. (FromV = IR, increasing V causes an increase in I, which causes thecapacitive load to charge more quickly.) However, it is moreaccurate to take into account boththe value of the supply voltage, and the difference between the supply voltage and the thresholdvoltage.

MaxClockSpeed ∝(VoltSup−VoltThresh)2

VoltSup

Question: If the delay along the critical path of a circuit is 20 ns, the supply voltageis 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if thesupply voltage is dropped to 2.2 V.

Answer:

d 20ns current delay along critical pathd′ ?? new delay along critical pathV 2.8V current supply voltageV ′ 2.2V new supply voltageVt 0.7V threshold voltage


MaxClockSpeed ∝ 1/d


VoltSup

d ∝V

(V−Vt)2

d′

d=

(V−Vt)2

V×

V ′

(V ′−Vt)2

d′ = d×(V−Vt)

2

V×

V ′

(V ′−Vt)2

= 20ns×(2.8V−0.7V)2

2.8V×

2.2V(2.2V−0.7V)2

= 31ns

Reducing Threshold Voltage Increases Leakage Current . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do notincrease the delay through the circuit. However, as threshold voltage drops, leakage currentincreases:

ILeak ∝ e

(−q×VoltThresh

k×T

)

And increasing the leakage current increases the power:

Power ∝ ILeak

So, need to strike a balance between reducingVoltSup (which has a quadratic affect on reducingpower), and increasingILeak, which has a linear affect on increasing power.

6.5 Data Encoding for Power Reduction

6.5.1 How Data Encoding Can Reduce Power

Data encoding is a technique that chooses data values so thatnormal execution will have a lowactivity factor.

The most common example is “Gray coding” where exactly one bit changes value each clockcycle when counting.

6.5.1 How Data Encoding Can Reduce Power 417

Decimal Gray Binary0 0000 00001 0001 00012 0011 00103 0010 00114 0110 01005 0111 01016 0101 01107 0100 01118 1100 10009 1101 100110 1111 101011 1110 101112 1010 110013 1011 110114 1001 111015 1000 1111

Two ways to understand the pattern for Gray-code count-ing. Both methods are based on noting when a bit in theGray code toggles from 0 to 1 or 1 to 0.

• To convert from binary to Gray, a bit in the Graycode toggles whenever the corresponding bit in thebinary code goes from 0 to 1. (US Patent 4618849issued in 1984).

• To implement a Gray code counter from scratch,number the bits from 1 ton, with a special less-than-least-significant bitq0. The output of the counterwill be qn . . .q1.

1. create a flop that toggles in each clock cycle:q0 <= not q0

2. bit 1 toggles wheneverq0 is 1.

3. For each biti ∈ 2..n, the counter bitqi toggleswheneverqi−1 is 1 and all of the bitsqi−2 . . .q0

are 0.

4. This behaviour can be implemented in aripple-carry style by introducing carry (ci) andtoggle (qti) signals for each bit.

q0 <= not(q0) reg asnc0 <= not(q0) comb asnci <= ci−1 and not(qi) comb asn

qti <= qi−1 and ci−2 comb asn

We create a toggle flip-flop by xoring the out-put of a D-flop with its toggle signal:

qi <= qi xor qti reg asn

Question: For an eight-bit counter, how much more power will a binary counterconsume than a Gray-code counter?

Answer:

Power consumption is dependent on area and activity factor. The originalpurpose of this problem was to focus on activity factor. The p roblemwas created under the mistaken assumption that a Gray code co unterand a binary code counter will both use the same area (1 fpga ce ll per


bit) and so the power difference comes from the difference in activityfactors. This mistake is addressed at the end of the solution .

For Gray coding, exactly one-bit toggles in each clock cycle. Thus, the activityfactor for an n-bit Gray counter will be 1

n.

For binary coding, the least significant bit toggles in every clock cycle, so ithas an activity factor of 1. The 2nd least-significant bit toggles in every otherclock cycle, so it has an activity factor of 1

2. We study the other bits and try tofind a pattern based on the bit position, i, where i = 0 for the least-significantbit and n−1 for the most significant bit of an n-bit counter. We see that for biti, the activity factor is 1

2i .

For an n-bit binary counter, the average activity factor is the sum of theactivity factors for the signals over the number of signals:

BinaryActFact =1

20 + 121 + 1

22 +···+ 12n−1

n

=1n×

n−1

∑i=0

2i−1

The limit of the summation term as n goes to infinity is 2. We can see this asan instance of Zeno’s paradox, in that with each step we halve the distance to2.

BinaryActFact ≈1n×2

≈2n

Find the ratio of the binary activity factor to the Gray-code activity factor.

BinaryActFactGrayActFact

=2n×

n1

= 2

In reality, the ripple-carry Gray code counter will always h ave twotransitions per clock cycle: one for the q0 toggle flop and one for theactual signal in the counter that toggles. Thus the Gray code counterwill consume more power than the binary counter. The overall powerreduction comes from the circuit that uses the Gray code.

6.5.2 Example Problem: Sixteen Pulser 419

Question: For completely random eight-bit data, how much more power will a binarycircuit consume than a Gray-code circuit?

Answer:

If the data is completely random, then the Gray code loses its feature thatconsecutive data will differ in only one bit position. In fact, the activity factorfor Gray code and binary code will be the same. There will not be any powersaving by using Gray code. A binary counter will consume the same power asa Gray-code circuit.

On average, half of the bits will be 1 and half will be 0. For each bit, there arefour possible transitions: 0→0, 0→1, 1→0, and 1→1. In these fourtransitions, two causes changes in value and two do not cause a change. Halfof the transitions result in a change in value, therefore for random data theactivity factor will be 0.5, independent of data encoding or the number of bits.

6.5.2 Example Problem: Sixteen Pulser

6.5.2.1 Problem Statement

Your task is to do the power analysis for a circuit that shouldsend out a one-clock-cycle pulse onthedone signal once every 16 clock cycles. (That is,done is ’0’ for 15 clock cycles, then ’1’ forone cycle, then repeat with 15 cycles of ’0’ followed by a ’1’,etc.)

done

1 2 3 1615 17 3231 33

clk

Required behaviour

You have been asked to consider three different types of counters: a binary counter, a Gray-codecounter, and a one-hot counter. (The table below shows the values from 0 to 15 for the differentencodings.)

Question: What is the relative amount of power consumption for the differentoptions?


6.5.2.2 Additional Information

Your implementation technology is an FPGA where each cell has a programable combinationalcircuit and a flip-flop. The combinational circuit has 4 inputs and 1 output. The capacitive load ofthe combinational circuit is twice that of the flip-flop.

PLA

cell

1. You may neglect power associated with clocks.

2. You may assume that all counters:

(a) are implemented on the same fabrication process

(b) run at the same clock speed

(c) have negligible leakage and short-circuit currents

6.5.2.3 Answer

Outline of Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .Factors to consider that distinguish the options: capacitance and activity factor:

Capacitance is dependent upon thenumber of signals, and whether a signal is combinational or aflop.

Sketch out the circuitry to evaluate capacitance.

Sketch the Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .Name the output “done” and the count digits “d()”.


PLA

PLA

PLA

PLA

d(0)

d(1)

d(2)

d(3)done

PLA

Block diagram for Gray and Binary Countersd(0) d(1)

d(15) donePLA PLA PLA

Block diagram for One-Hot

Observation:

The Gray and Binary counters have the same design, and the Gray counter will havethe lower activity factor. Therefore, the Gray counter willhave lower power than theBinary counter.

However, we don’t know how much lower the power of the Gray counter will be, andwe don’t know how much power the One-Hot counter will consume.

Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .


cap number subtotal capGray d() PLAs 2 4 8

Flops 1 4 4done PLAs 2 1 2

Flops 1 0 01-Hot d() PLAs 2 0 0


Flops 1 0 0Binary d() PLAs 2 4 8


Flops 1 0 0

Activity Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

d(0)

d(1)

d(2)

d(3)

done

clk

4/16

2/16

2/16

2/16

8/16

Gray coding

d(0)

d(1)

d(2)

done

clk

2/16

2/16

2/16

2/16

2/16

One-hot coding

d(0)

d(1)

d(2)

d(3)

done

clk

8/16

4/16

2/16

2/16

16/16

Binary coding


act factGray d() PLAs 1/4 signals in each clock cycle

Flops 1/4 signals in each clock cycledone PLAs 2 transitions / 16 clock cycles

Flops —1-Hot d() PLAs —

Flops 2 transitions / 16 clock cyclesdone PLAs —

Flops —

Binary d() PLAs16 + 8 + 4 + 2 transitions

4 signals× 16 clock cycles= 0.47

Flops16 + 8 + 4 + 2 transitions

4 signals× 16 clock cycles= 0.47

done PLAs 2 transitions / 16 clock cyclesFlops —

Note: Activity factor for One-Hot counter Because all signals have samecapacitance, and all clock cycles have the same number of transitions for theOne-Hot counter, could have calculated activity factor as two transitions persixteen signals.


Putting it all Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .subtotal cap act fact power

Gray d() PLAs 8 1/4 2Flops 4 1/4 1

done PLAs 2 2/16 4/16Flops 0 — 0Total 3.25

1-Hot d() PLAs 0 — 0Flops 16 2/16 2

done PLAs 0 — 0Flops 0 — 0Total 2

Binary d() PLAs 8 0.47 3.76Flops 4 0.47 1.88

done PLAs 2 2/16 0.25Flops 0 — 0Total 5.87

If choose Binary counting as baseline, then relative amounts of power are:

Gray 54%One-Hot 35%Binary 100%

If choose One-Hot counting as baseline, then relative amounts of power are:

Gray 156%One-Hot 100%Binary 288%

6.6 Clock GatingThe basic idea of clock gating is to reduce power by turning off the clock when a circuit isn’tneeded. This reduces the activity factor.

6.6.1 Introduction to Clock Gating

Examples of Clock Gating

Condition Circuitry turned offO/S in standby mode Everything except “core” state (PC, registers, caches, etc)No floating point instructionsfor k clock cycles

floating point circuitry

Instruction cache miss Instruction decode circuitryNo instruction in pipe stagei Pipe stagei

6.6.2 Implementing Clock Gating 425

Design Tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .

+ Can significantly reduce activity factor (Synopsys PowerCompiler claims that can cut powerto be 50–80% of ungated level)

− Increases design complexity

• design effort

• bugs!

− Increases area

− Increases clock skew

Functional Validation and Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .It’s a functional bug to turn a clock off when it’s needed for valid data.

It’s functionally ok, but wasteful to turn a clock on when it’s not needed.

(About 5% of the bugs caught on Willamette (Intel Pentium 4 Processor) were related to clockgating.) Nicolas Mokhoff. EE Times. June 27, 2001.http://www.edtn.com/story/OEG20010621S0080

6.6.2 Implementing Clock Gating

Clock gating is implemented by adding a component that disables the clock when the circuit isn’tneeded.

i_data

clk

o_data

i_valid

o_valid

Without clock gating

Clock EnableState Machine

clk

i_wakeup

clk_en

cool_clk

i_data o_data

i_valid

o_valid

With clock gating


The total power of a circuit with clock gating is the sum of thepower of the main circuit with areduced activity factor and the power of the clock gating state machine with its activity factor.The clock-gating state machine must always be on, so that it will detect the wakeup signal — donot make the mistake of gating the clock to your clock gating circuit!

6.6.3 Design Process

Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .• What level of granularity for gated clocks?

– entire module?

– individual pipe stages?

– something in between?

• When should the clocks turn off?

• When should the clocks turn on?

• Protocol for incoming wakeup signal?

• Protocol for outgoing wakeup signal?

Wakeup Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .Designers negotiate incoming and outgoing wakeup protocolwith environment.

An example wakeup protocol:

• wakeupin will arrive 1 clock cycle before valid data

• wakeupin will stay high until have at least 3 cycles of invalid data

Design Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .When designing clock gating circuitry, consider the two extreme case:

• a constant stream of valid data

• circuit is turned off and receives a single parcel of valid data

For a constant stream of valid data, the key is to not incur a large overhead in design complexity,area, or clock period when clocks will always be toggling.

For a single parcel of valid data, the key is to make sure that the clocks are toggling so that datacan percolate through circuit. Also, we want to turn off the clock as soon as possible after dataleaves.

6.6.4 Effectiveness of Clock Gating 427

6.6.4 Effectiveness of Clock Gating

We can measure the effectiveness of clock gating by comparing the percentage of clock cycleswhen the clock is not toggling to the percentage of clock cycles that the circuit does not havevalid data (i.e. the clock does not need to toggle).

The most ineffective clock gating scheme is to never turn offthe clock (let the clock alwaystoggle). The most effective clock gating scheme is to turn off the clock whenever the circuit is notprocessing valid data.

Parameters to characterize effectiveness of clock gating:

Eff = effectiveness of clock gatingPctValid = percentage of clock cycles with valid datain the circuit — the clock

must be togglingPctClk = percentage of clock cycles that clock toggles

Effectiveness measures the percentage of clock cycles withinvalid data in which the clock isturned off. Equation for effectiveness of clock gating:

Eff =PctClkOffPctInvalid

=1−PctClk

1−PctValid

Question: What is the effectiveness if the clock toggles only when there is valid data?

Answer:

PctClk = PctValid and the effectiveness should be 1:

Eff =1−PctClk

1−PctValid

=1−PctValid1−PctValid

= 1

Question: What is the effectiveness of a clock that always toggles?


Answer:

If the clock is always toggling, then PctClk = 100% and the effectivenessshould be 0.

Eff =1−PctClk

1−PctValid

=1−1

1−PctValid

= 0

Question: What does it mean for a clock gating scheme to be 75% effective?

Answer:75% of the time that the there is invalid data, the clock is off.

Question: What happens ifPctClk < PctValid?

Answer:If PctClk < PctValid, then:

1−PctClk > 1−PctValid

so, effectiveness will be greater than 100%.

In some sense, it makes sense that the answer would be nonsense, becausea clock gating scheme that is more than 100% effective is too effective: it isturning off the clock sometime when it shouldn’t!

We can see the effect of the effectiveness of a clock-gating scheme on the activity factor:

A’

Eff

A

0 10

PctValid * A

When the effectiveness is zero, the new activity factor is the same as the original activity factor.For a 100% effective clock gating scheme, the activity factor is A×PctValid . Between 0% and100% effectiveness, the activity factor decreases linearly.

The new activity factor with a clock gating scheme is:

A′ = A− (1−PctValid)×Eff ×A

6.6.5 Example: Reduced Activity Factor with Clock Gating 429

6.6.5 Example: Reduced Activity Factor with Clock Gating

Question: How much power will be saved in the following clock-gating scheme?

• 70% of the time the main circuit has valid data

• clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clockis off)

• clock gating circuit has 10% of the area of the main circuit

• clock gating circuit has same activity factor as main circuit

• neglect short-circuiting and leakage power

Answer:

1. Set up main equations


PwrMain = power for main circuit without clock gatingPwr′Main = power for main circuit with clock gating

PwrClkFsm = power for clock enable state machine

PwrTot = PwrMain +PwrClkFsm

Pwr = PwrSw+PwrLk+PwrShort

PwrSw =12×A×C×V2

PwrLk = negligiblePwrShort = negligible

Pwr =12×A×C×V2

PwrTot =

(12×AMain×CMain×V2

)

+

(12×AClkFsm×CClkFsm×V2

)

AMain = A

CMain = C

AClkFsm = A

CClkFsm = 0.1C

A′Main = A′

A′ClkFsm = A

Pwr′TotPwrTot

=

(12×A′×C×V2

)

+

(12×A×0.1C×V2

)

12×A×C×V2

=A′+0.1A

A

2. Find new activity factor for main circuit (A′):

A′ = (1−Eff(1−PctValid))×A= (1−0.9(1−0.7))×A= 0.73A

3. Find ratio of new total power to previous total power:

6.6.6 Clock Gating with Valid-Bit Protocol 431

Pwr′TotPwrTot

=A′+0.1A

A

=0.73A+0.1A

A

= 0.83

4. Final answer: new power is 83% of original power

6.6.6 Clock Gating with Valid-Bit Protocol

A common technique to determine when a circuit has valid datais to use a valid-bit protocol. Insection 6.6.6.1 we review the valid-bit procotol and then insection 6.6.6.3 we add clock-gatingcircuitry to a circuit that uses the valid-bit protocol.

6.6.6.1 Valid-Bit Protocol

Need a mechanism to tell circuit when to pay attention to datainputs — e.g. when is it supposedto decode and execute an instruction, or write data to a memory array?

clk

i_valid

i_data o_data

o_valid

clk

i_valid

i_data

o_data

o_valid

α β γ

α β γ

i valid : high wheni data has valid data — signifies whether circuit should pay attention toor ignore data.

o valid : high wheno data has valid data — signifies whether whether environment shouldpay attention to output of circuit.

For more on circuit protocols, see section 2.12.


Microscopic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .Which clock edges are needed?

i_valid

clk

o_valid

clk

i_valid

o_valid


6.6.6.2 How Many Clock Cycles for Module?

Given a module with latencyLat , if the module receives a stream ofNumPcls consecutive validparcels, how many clock cycles must the clock-enable signalbe asserted?

Latency NumPcls NumClkEn

i_valido_validclk_en

Latency NumPcls NumClkEn







ti1 time of first i validto1 time of firsto validtik time of lasti validtok time of lasto validtstart first clock cycle with clock enabledtlast last clock cycle with clock enabled

Initial equations to describe relationships between different points in time:

to1 = ti1+Lattok = to1+NumPcls−1

tfirst ti1+1tlast tok+1

To understand the−1 in the equation fortok, examine the situation whenNumPcls = 1. With justone parcel going through the systemto1 = ti1+Lat , so we have:tok = to1+1−1.

In the equation fortlast , we need the+1 to clear the last valid bit.

Solve for the length of time that the clock must be enabled. The+1 at the end of this equation isbecuase iftlast = tfirst , we would have the clock enabled for 1 clock cycle.


ClkEnLen = tlast − tfirst +1= tok+1− (ti1+1)+1= tok− ti1+1= to1+NumPcls−1− ti1+1= to1+NumPcls− ti1= ti1+Lat +NumPcls− ti1= Lat +NumPcls

We are left with the formula that the number of clock cycles that the module’s clock must beenabled is the latency through the module plus the number of consecutive parcels.

6.6.6.3 Adding Clock-Gating Circuitry

Before Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .data_in

clk

data_out

valid_in valid_out

clk

α β δγ

α β γ

data_in

valid_in

data_out

valid_out don’t care

uninitialized

After Clock Gating: Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .

Clock EnableState Machine

data_in

hot_clk

wakeup_in

data_out

clk_en

cool_clk

valid_in valid_out

wakeup_out

• hot clk : clock that always toggles


• cool clk : gated clock — sometimes toggles, sometimes stays low

• wakeup : alerts circuit that valid data will be arriving soon

• clk en : turns oncool clk


After Clock Gating: New Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .

α β δγ

α β γ

data_in

valid_in

hot_clk

data_out

valid_out

wakeup_in

cool_clk

clk_en

wakeup_out

6.6.7 Example: Pipelined Circuit with Clock-Gating 437

6.6.7 Example: Pipelined Circuit with Clock-Gating

Design a “clock enable state machine” for the pipelined component described below.• capacitance of pipelined component = 200

• latency varies from 5 to 10 clock cycles, even distribution of latencies

• contains a maximum of 6 instructions (parcels of data).

• 60% of incoming parcels are valid

• average length of continuous sequence of valid parcels is 80

• use input and output valid bits for wakeup

• leakage current is negligible

• short-circuit current is negligible

• LUTs have a capacitance of 1, flops have a capacitance of 2

The two factors affecting power are activity factor and capacitance.

1. Scenario: turned off and get one parcel.

(a) Need to turn on and stay on until parcel departs

(b) idea #1 (parcel count):• count number of parcels inside module

• keep clocks toggling if have non-zero parcels.

(c) idea #2 (cycle count):• count number of clock cycles since last valid parcel enteredmodule

• once hit 10 clock cycles without any valid parcels entering,know that all parcelshave exited.

• keep clocks toggling if counter is less than 10

2. Scenario: constant stream of parcels

(a) parcel count would require looking at input and output stream and conditionallyincrementing or decrementing counter

(b) cycle countwould keep resetting counter

Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .

i_valid

1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17

o_valid

parcel_count

parcel_clk_en

18 19 20 21 22 23 24


i_valid

1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17

o_valid

cycle_count

1 2 0 0 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1000

cycle_clk_en

18 19 20 21 22 23 24

5

Outline:

1. sketch out circuitry for parcel count and cycle count state machine

2. estimate capacitance of each state machine

3. estimate activity factor of main circuit, based on behaviour

Parcel Count Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .Need to count (0..6) parcels, therefore need 3 bits for counter.

Counter must be able to increment and decrement.

Equations for counter action (increment/decrement/no-change):

i valid o valid action0 0 no change0 1 decrement1 0 increment1 1 no change

6.7. POWER PROBLEMS 439

6.7 Power Problems

P6.1 Short Answers

P6.1.1 Power and Temperature

As temperature increases, does thepower consumed by a typical combinational circuit increase,stay the same, or decrease?

P6.1.2 Leakage Power

The new vice president of your company has set up a contest forideas to reduce leakage power inthe next generation of chips that the company fabricates. The prize for the person who submitsthe suggestion that makes the best tradeoff between leakagepower and other design goals is tohave a door installed on their cube. What is your door-winning idea, and what tradeoffs will youridea require in order to achieve the reduction in leakage power?

P6.1.3 Clock Gating

In what situations could adding clock-gating to a circuit increase power consumption?

P6.1.4 Gray Coding

What are the tradeoffs in implementing a program counter fora microprocessor using Graycoding?

P6.2 VLSI Gurus

The VLSI gurus at your company have come up with a way to decrease the average rise and falltime (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabricationtweaks, they can decrease this to 0.85ns .

P6.2.1 Effect on Power

If you implement their suggestions, and make no other changes, what effect will this have onpower?(NOTE: Based on the information given, be as specific as possible.)


P6.2.2 Critique

A group of wannabe performance gurus claim that the above optimization can be used to improveperformance by at least 15%. Briefly outline what their plan probably is, critique the merits oftheir plan, and describe any affect their performance optimization will have on power.

P6.3 Advertising Ratios

One day you are strolling the hallways in search of inspiration, when you bump into a personfrom the marketing department. The marketing department has been out surfing the web and hasnoticed that companies are advertising the MIPs/mm2, MIPs/Watt, and Watts/cm3 of theirproducts. This wide variety of different metrics has confused them.

Explain whether each metric is a reasonable metric for customers to use when choosing a system.

If the metric is reasonable, say whether “bigger is better” (e.g. 500 MIPs/mm2 is better than 20MIPs/mm2) or “smaller is better” (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2), and whichonetype of product (cell phone, desktop computer, or compute server) is the metricmost relevantto.• MIPs/mm2

• MIPs/Watt

• Watts/cm3

P6.4 Vary Supply Voltage

As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuitcan run at decreases.

The scaling down of supply voltage is a popular technique forminimizing power. The maximumclock speed is related to the supply voltage by the followingequation:


VoltSup

WhereVoltSup is supply voltage andVoltThresh is threshold voltage.

With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed ismeasured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?

P6.5 Clock Speed Increase Without Power Increase 441

P6.5 Clock Speed Increase Without Power Increase

The following are given:• You need to increase the clock speed of a chip by 10%

• You must not increase its dynamic power consumption

• The only design parameter you can change is supply voltage

• Assume that short-circuiting current is negligible

P6.5.1 Supply Voltage

How much do you need to decrease the supply voltage by to achieve this goal?


What problems will you encounter if you continue to decreasethe supply voltage?

P6.6 Power Reduction Strategies

In each low power approach described below identify which component(s) of the power equationis (are) being minimized and/or maximized:


Designers scaled down the supply voltage of their ASIC

P6.6.2 Transistor Sizing

The transistors were made larger.

P6.6.3 Adding Registers to Inputs

All inputs to functional units are registered

P6.6.4 Gray Coding

Gray coding of signals is used for address signals.


P6.7 Power Consumption on New Chip

While you are eating lunch at your regular table in the company cafeteria, a vice president sitsdown and starts to talk about the difficulties with a new chip.

The chip is a slight modification of existing design that has been ported to a new fabricationprocess. Earlier that day, the first sample chips came back from fabrication. The good news is thatthe chips appear to function correctly. The bad news is that they consume about 10% more powerthan had been predicted.

The vice president explains that the extra power consumption is a very serious problem, becausepower is the most important design metric for this chip.

The vice president asks you if you have any idea of what might cause the chips to consume morepower than predicted.

P6.7.1 Hypothesis

Hypothesize a likely cause for the surprisingly large powerconsumption, and justify why yourhypothesis is likely to be correct.

P6.7.2 Experiment

Briefly describe how to determine if your hypothesized causeis the real cause of the surprisinglylarge power consumption.

P6.7.3 Reality

The vice president wants to get the chips out to market quickly and asks you if you have any ideasfor reducing their power without changing the design or fabrication process. Describe your ideas,or explain why her suggestion is infeasible.

Chapter 7

Fault Testing and Testability

7.1 Faults and Testing

7.1.1 Overview of Faults and Testing

7.1.1.1 Faults

During manufacturing, faults can occur that make the physical product behave incorrectly.

Definition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to eitherbreak or connect to something it shouldn’t.

Good wires Shorted wires Open wire

7.1.1.2 Causes of Faults

• Fabrication process (initial construction is bad)– chemical mix

– impurities

– dust

• Manufacturing process (damage during construction)– handling∗ probing

∗ cutting

∗ mounting

443

444 CHAPTER 7. FAULT TESTING AND TESTABILITY

– materials

∗ corrosion

∗ adhesion failure

∗ cracking

∗ peeling

7.1.1.3 Testing

Definition Testingis the process of checking that the manufactured wafer/chip/board/system hasthe same functionality as the simulations.

7.1.1.4 Burn In

Some chips that come off the manufacturing line will work fora short period of time and then fail.

Definition Burn-in: The process of subjecting chips to extreme conditions (high and low temps,high and low voltages, high and low clock speeds) before and during testing.

The purpose is to cause (and catch) failures in chips that would pass a normal test, but fail in earlyuse by customers.

Soon to break wire

The hope is that the extreme conditions will cause chips to break that would otherwise havebroken in the customers system soon after arrival.

The trick is to create conditions that are extreme enough that bad chips will break, but not soextreme to cause good chips to break.

7.1.1.5 Bin Sorting

Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled(binned) by the maximum clock frequency at which they will work reliably.

For example, chips coming off of the same production line might be labelled as 800MHz,900MHz, and 1000MHz.

Overclocking is taking a chip rated atnMHz and running it at 1.x×nMHz. (Sure your computeroften crashes and loses your assignment, but just think how much more productive you are when itis working...)

7.1.1 Overview of Faults and Testing 445

7.1.1.6 Testing Techniques

Scan Testing or Boundary Scan Testing (BST, JTAG)• Load test vector from tester into chip

• Run chip on test data

• Unload result data from chip to tester

• Compare results from chip against those produced by simulation

• If results are different, then chip was not manufactured correctly

Built In Self Test (BIST)• Build circuitry on chip that generates tests and compares actual and expected results

IDDQ Testing• Measure the quiescent current between VDD and GND.

• Variations from expected values indicate faults.

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . .The challenges in testing:

• test circuitry consumes chip area

• test circuitry reduces performance

• decrease fault escapee rate of product that ships while having minimal impact on productioncost and chip performance

• external tester can only look at I/O pins

• ratio of internal signals to I/O pins is increasing

• some faults will only manifest themselves at high-clock frequencies

“The crux of testing is to use yesterday’s technology to find faults in tomorrow’s chips.”Agilentengineer at ARVLSI 2001.

7.1.1.7 Design for Testability (DFT)

Scan testing and self-testing require adding extra circuitry to chips.

Design for test is the process of adding this circuitry in a disciplined and correct manner.

A hot area of research, that is becoming mainstream practice, is developing synthesis tools toautomatically add the testing circuitry.


7.1.2 Example Problem: Economics of Testing

Given information:

• The ACHIP costs $10 without any testing

• Each board uses one ACHIP (plus lots of other chips that we don’t care about)

• 68% of the manufactured ACHIPS do not have any faults

• For the ACHIP, it costs $1 per chip to catch half of the faults

• Each 50% reduction in fault escapees doubles cost of testing(intuition: doubles number of teststhat are run)

• If board-level testing detects a bad ACHIP, it costs $200 to replace the ACHIP

• Board-level testing will detect 100% of the faults in an ACHIP

Question: What escapee fault rate will minimize cost of the ACHIP?

Answer:

TotCost = NoTestCost+TestCost+EscapeeProb×ReplaceCost

NoTestCost Testcost EscapeeProb ReplaceCost TotCost$10 $0 32% (200×0.32 = $64) $74$10 $1 16% (200×0.16 = $32) $43$10 $2 8% (200×0.08 = $16) $28$10 $4 4% (200×0.04 = $8) $22$10 $8 2% (200×0.02 = $4) $22$10 $16 1% (200×0.01 = $2) $28$10 $32 0.5% (200×0.005 = $1) $43

The lowest total cost is $22. There are option with a total cost of $22: $4 oftesting and $8 of testing. Economically, we can choose either option.

For high-volume, small-area chips, testing can consume more than 50% of the total cost.

7.1.3 Physical Faults 447

7.1.3 Physical Faults

7.1.3.1 Types of Physical Faults

Good Circuit Bad Circuitsab

cd open

ab

cd

wired-AND bridging shortab

cd

wired-OR bridging shortab

cd

stronger wins bridging shortab

cd

(b is stronger)

short to VDDab

cd

short to GND

ab

cd

7.1.3.2 Locations of Faults

Eachsegmentof wire, poly, diffusion, via, etc is a potential fault location.

Different segments affect different gates in the fanout.

A potential fault location is a segment or segments where a fault at any position affects the sameset of gates in the same way.

b b

b BAD

BAD

b

OK

BAD b OK

BAD

Three different locations for potential faults.


When working with faults, we work with wiresegments, not signals. In the circuit below, thereare 8 different wire segments (L1–L8). Each wire segment corresponds to a logically distinct faultlocation. All physical faults on a segment affect the same set of signals, so they are groupedtogether into a “logical fault”. If a signal has a fanout of 1,then there is one wire segment. Asignal with a fanout ofn, wheren > 1, has at leastn+1 wire segments — one for the sourcesignal and one for each gate of fanout. As shown in section 7.1.3.3, the layout of the circuit canhave more thann+1 segments.

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

7.1.3.3 Layout Affects Locations

a

d

ef

g

h

ibc

e

g

h

bL1

L2

L3

L4

e

g

h

bL1

L2

L3

L4

L5

For the signal b in the schematic above, we can have either four or five different locations forpotential faults, depending upon how the circuit is layed out.

7.1.3.4 Naming Fault Locations

Two ways to name a fault location:

pin-fault model Faults are modelled as occuring on input and output pins of gates.

net-fault model Faults are modelled as occuring on segments of wires.

In E&CE 327, we’ll use the net-fault model, because it is simpler to work with and is closer towhat actually happens in hardware.

7.1.4 Detecting a Fault

To detect a fault, we compare the actual output of the circuitagainst the expected value.

To find a test vector that will detect a fault:

7.1.4 Detecting a Fault 449

1. build Boolean equation (or Karnaugh map) of correct circuit

2. build Boolean equation (or Karnaugh map) of faulty circuit

3. compare equations (or Karnaugh maps), regions of difference represent test vectors thatwill detect fault

7.1.4.1 Which Test Vectors will Detect a Fault?

Question: For the good circuit and faulty circuit shown below, which test vectors willdetect the fault?

a b

c

d

e

Good circuit

a b

c

d

e

Faulty circuit

Answer:

a b c good faulty0 0 0 0 00 0 1 1 10 1 0 0 00 1 1 1 11 0 0 0 01 0 1 1 11 1 0 1 0 ←−1 1 1 1 1

The only test vector that will detectthe fault in the circuit is 110.

Sometimes multiple test vectors will catch the same fault.

Sometimes a single test vector can catch multiple faults.

a b

c

d

e

Another fault

a b c good faulty1 1 0 1 0 ←−

The test vector 110 can catch both this fault and the previousone.

With testing, we are primarily concerned with determining whether a circuit works correctly ornot — detectingwhether there is a fault. If the circuit has a fault, we usually do not care where


the fault is —diagnosingthe fault. To detect the two faults above, the test vector 110issufficient, because if either of the two faults is present, 110 will detect that the circuit does notwork correctly.

Note: Detect vs. diagnose Testingdetects faults. Testing doesnot diagnosewhich fault occurred.

If we have a higher-than-expected failure rate for a chip, wemight want to investigate the cause ofthe failures, and so would need to diagnose the faults. In this case, we might do more exhaustiveanalysis to see which test vectors pass and which fail. We might also need to examine the chipphysically with probes to test a few individual wires or transistors. This is done by removing thetop layers of the chip and using very small and very sensitiveprobes, analogous to how we use amultimeter to test a circuit on a breadboard.

7.1.5 Mathematical Models of Faults

Goal: develop reliable and predictable technique for detecting faults in circuits.

Observations:

• The possible faults in a circuit are dependent upon the physical layout of the circuit.

• A very wide variety of possible faults

• A single test vector can catch many different faults

Need: a mathematical model for faults that is abstracted from complexities of circuit layout andplethora of possible faults, yet still detects most or all possible faults.

7.1.5.1 Single Stuck-At Fault Model

Although there are many different bad behaviours that faults can lead to, the simple model ofsingle-stuck-at-faultshas proven very capable of finding real faults in real circuits.

Two simplifying assumptions:

1. A maximum of one fault per tested circuit (hence “single”)

2. All faults are either:

(a) stuck-at 1: short to VDD

(b) stuck-at 0: short to GND

hence, “stuck at”

7.1.6 Generate Test Vector to Find a Mathematical Fault 451

Example of Stuck-At Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .

a

d

ibc

L1

L2

L3

L4

L5

L6

L7

L8L9

L10

L11

L12

12 fault locations× 2 types of faults= 24 possible faults.

If restrict to single stuck-at fault model, then have 24 faulty circuits to consider.

If allowed multiple faults, then the circuit above could have up to 12 different faults. How manyfaulty circuits would need to be considered?

Each of the 12 locations has three possible values: good, stuck-at-1, stuck-at-0. Therefore,312 = 5.3×105 different circuits would need to be considered!

If allowed multiple faults of 4 different types at 12 different locations, then would have512−1 = 2.4×108 different faulty circuits to consider!

There are 224= 6.6×104 different Boolean functions of four inputs (A k-map of four variables is

a grid of 24 squares; each square is either 0 or 1, which gives 224different combinations). There

are 6.6×104 possible equations for circuits with four inputs and one output. This is much lessthan the number of faulty circuit models that would be generated by thesimultaneous-faults-at-every-location models. So both of thesimultaneous-faults-at-every-location models are too extreme.

7.1.6 Generate Test Vector to Find a Mathematical Fault

Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) withtest-vectorsand checking that the real circuit gives the correct output.

Standard practice in testing is to test circuits for single stuck-at faults. Mathematics and empiricalevidence demonstrate that if a circuit appears to be free of single stuck-at faults, then probably italso free of other types of faults. That is, testing a circuitfor single stuck-at faults will also detectmany other types of faults and will often detect multiple faults.

7.1.6.1 Algorithm

1. compute Karnaugh map for correct circuit

2. compute Karnaugh map for faulty circuit

3. find region of disagreement


4. any assignment in region of disagreement is a test vector that will detect fault

5. any assignment outside of region of disagreement will result in same output on both correctand faulty circuit

7.1.6.2 Example of Finding a Test Vector

a b

c

d

e

a b

c

d

e

c

ba

1

0

10 11 01 00ba ba ba

c

a b

c

ab

c

Good circuit Faulty circuit

ab

c

Difference between good and faulty circuits

7.1.7 Undetectable Faults

Not all faults are detectable.

1. If a circuit is irredundantthen all single stuck-at faults can be detected.

A redundantcircuit is one where one or more gates can be removed withoutaffecting the functional behaviour.

2. If not trying to find all of the faults in a circuit, then a fault that you aren’t looking for canmask a fault that you are looking for.

7.1.7.1 Redundant Circuitry

Some faults are undetectable. Undetectable stuck-at faults are located inredundant parts of acircuit.

7.1.7 Undetectable Faults 453

Timing Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .Static hazard

Dynamic hazard

Timing hazards are often removed by addingredundant circuitry.

Redundant Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .

ab

c

1,0

1,1

1,1

0,10,1

1,0

1,0,1

d

e

fg

Irredundant circuit

a

b

c

d

e

f

g

Illustration of timing hazard

Glitch ong is caused because theAND gate fore turns off beforef turns on.

Question: Add one or more gates to the circuit so that the static hazard is guaranteedto be prevented,independent of the delay values through the gates

In this sum-of-products style circuit, eachAND gate corresponds to a cube in the Karnaugh map.

a b

c

We can prevent this transition from causing a glitch by adding a cube that covers the two squaresof the transition from 111 to 101. This cube is 1-1, which is the black cube in the Karnaugh mapbelow and the signalh in the redundant circuit below.

a b

c

a b

c

ab

c

d

e

f

gh L1

Redundant circuit

a

b

c

d

e

f

g

h

No more timing hazards


Question: Has the redundant circuitry introduced any undetectable faults? If so,identify an undetectable fault.

L1@0 is undetectable.

Correct circuitab+bc

Faulty circuitab+bc+ac

With L1@0, ac−→ 0ab+bc+0

ab+bcSame equation as correct circuit

A stuck-at fault in redundant circuitry will not affect the steady state behaviour of the circuit, butcould allow timing glitches to occur.

7.1.7.2 Curious Circuitry and Fault Detection

The two circuits below have the same steady-state behaviour.

a

b

c

zL1

L2

L3

a

c

z

ab

c

Because the two circuits have the same behaviour, it might appear that the leftmost twoXOR gatesare redundant. However, these gates arenot redundant. In the test for redundancy, when weremovea gate, we delete it; we do not replace it with other circuitry.

Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.

fault eqn K-map diff w/ ckt

L2@0 a⊕ (b⊕c)

ab

c

ab

c

L2@1 a⊕ (b⊕c)

ab

c

ab

c

7.2. TEST GENERATION 455

7.2 Test Generation

7.2.1 A Small Example

Throughout this section we will use the circuit below:

a

b

c

zL2

L4

L5

ab+bca

bc

At first, we will consider only the following faults: L2@1, L4@1, L5@1.

fault eqn K-map diff w/ ckt test vectors

1) L2@1 a+c

ab

c

ab

c

101, 001, 100

2) L4@1 a+bc

ab

c

ab

c

101, 100

3) L5@1 ab+c

ab

c

ab

c

101, 001

Choose Test Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

If we choose 101, we can detect all three faults. Choosing either 001or 100 will miss one of the three faults.

ab

c

7.2.2 Choosing Test Vectors

The goal of test vector generation is to find the smallest set of test vectors that will detect thefaults of interest.

Test vector generation requires analyzing the faults.

We can simplify the task of fault analysis by reducing the number of faults that we have toanalyze.

Smith has examples of this in Figures 14.13 and 14.14.


7.2.2.1 Fault Domination

fault eqn K-map Diff w/ ckt test vectors

1) L5@1 ab+c

ab

c

ab

c

101, 001

2) L6@1 1

ab

c

ab

c

101, 001, 100, 010, 000

Any test vector that detects L5@1 will also detect L6@1: L5@1is detected by 101 and 001, eachof which will detect L6@1. L6@1 doesnot dominate L5@1, because there is at least one testvector that detecs L6@1 but does not detect L5@1 (e.g. each of100, 010, 000 detect L6@1 butnot L5@1).

Definition dominates: f1 dominatesf2: any test vector that detectsf1 will also detectf2.

When choosing test vectors, we can ignore thedominatedfault, but must keep thedominantfault.

L5@1 dominates L6@1.

When choosing test vectors we can ignore L6@1 and just include L5@1.

Question: To detect both L5@1 and L6@1, can we ignore one of the faults?

Answer:We can ignore L6@1, because L5@1 dominates L6@1: each test vector

that detects L5@1 also detects L6@1.

Question: What would happen if we ignored the “wrong” fault?

Answer:If we ignore L5@1, but keep L6@1, we can choose any of 5 test vectors that

detect L6@1. If we chose 100, 010, or 000 as our test vector to detect L6@1,then we would not detect L5@1.

7.2.2 Choosing Test Vectors 457

7.2.2.2 Fault Equivalence

fault eqn K-map Diff w/ ckt

1) L1@1 b

ab

c

ab

c

2) L3@1 b

ab

c

ab

c

The two faults above are “equivalent”.

Definition fault equivalence: f1 is equivalent tof2: f1 and f2 are detected by exactlythe same set of test vectors. That is, all of the test vectors that detectf1 will alsodetectf2, andvice versa.

When choosing test vectors we can ignore one of the faults andjust include the other.

7.2.2.3 Gate Collapsing

A controlling value on an input to a gate forces the output to be the controlled value. If a stuck-atfault on the input causes the input to have a controlling value, then that fault is equivalent to theoutput having a stuck-at fault of being at the controlled value.

For example, a 1 on the input to anOR gate will force the output to be 1. So, a stuck-at-1 fault oneither input to anOR gate is equivalent to a stuck-at-1 fault on the output of the gate, and isequivalent to a stuck-at-1 fault on any other input to theOR gate.

A stuck-at-1 fault on the input to anOR gate is equivalent to a stuck-at-1 fault on the output of theOR gate.

Definition Gate collapsing: : The technique of looking at the functionality of a gate andfinding equivalent faults between inputs and outputs.

Sets of collapsable faults for common gates

AND

@0

@0@0

OR

@1

@1@1

QuestionWhat is the set of collapsible faults for a NAND gate?

NAND


Answer:To determine the collapsible faults, treat the NAND gate as an AND gate

followed by an inverter, then invert the faults on the output of the gate.

AND + NOT@0

@0@0

NAND@0

@0@1

7.2.2.4 Node Collapsing

Note: Node collapsing is relevant only for the pin-fault model

When two segments affect the same set of gates (ignoring any gates between the two segments),then faults on the two segments can be collapsed.

With an invertor or buffer, the segment on the input affects the same gates as the output.Therefore, faults on the input and output segments are equivalent.

Sets of collapsable faults for nodes

NOT-1@1 @0

NOT-0@1@0

With the net-fault model, which is the one we are using in E&CE327, inverters and buffers arethe only gates where node collapsing is relevant.

With the pin-fault model, where faults are modelled as occuring on the pins of gates, there areother instances where node collapsing can be used.

7.2.2.5 Fault Collapsing Summary

When calculating the test-vectors to detect a set of faults,apply the fault collapsing techniques of:• gate collapsing

• node collapsing (if using pin-fault model)

• general fault equivalence (intelligent collapsing)

• fault domination

to reduce the number of faults that you must examine.

Fault collapsing is an optimization. If you skip this step, you will still get the correct answer, itwill just take more work to get the correct answer, because ineach step you will analyze a greaternumber of faults than if you do fault collapsing.

7.2.3 Fault Coverage 459

7.2.3 Fault Coverage

Definition Fault coverage: percentage of detectable faults that are detected by a set of test vectors.

FaultCoverage =DetectedFaults

DetectableFaults

Some people’s definition of fault coverage has a denominatorof AllPossibleFaults, not just thosethat are detectable.

If the denominator isAllPossibleFaults, then, if a circuit has 100% single stuck-at fault coveragewith a suite of test vectors, then each stuck-at fault in the circuit can be detected by one or morevectors in the suite. This also means that the circuit has no undetectable faults, and hence, noredundant circuitry.

Even if the denominator isAllPossibleFaults, it is possible that achieving 100% coverage forsingle stuck at faults will allow defective chips to pass if they have faults that are not stuck-at-1 orstuck-at-0.

I think, but haven’t seen a proof, that achieving 100% singlestuck-at coverage will detect allcombinations of multiple stuck-at faults. But, if you do notachieve 100% coverage, then astuck-at fault that you aren’t testing for can mask (hide) a fault that you are testing for.

NOTE: In Smith’s book, undetectable faults don’t hurt your coverage. This is not universally true.

7.2.4 Test Vector Generation and Fault Detection

There are two ways to generate vectors and check results: built-in tests and scan testing.

Both require:• generate test vectors

• overide normal datapath to send test-vectors, rather than normal inputs, as inputs to flops

• compare outputs of flops to expected result

7.2.5 Generate Test Vectors for 100% Coverage

In this section we will find the test vectors to achieve 100% coverage of single stuck at faults forthe circuit of the day.

We will use a simple algorithm, there are much more sophisticated algorithms that are moreefficient.


The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG)and continues to be an active area of research.

A trendy idea is to use Genetic Algorithms (inspired by how DNA works) to generate test vectorsthat catch the maximum number of faults.

The “classic” algorithm is the D algorithm invented by Roth in 1966 (Smith 14.5.1, 14.5.2).

An enhanced version is the Path-Oriented D Algorithm (PODEM), which supports reconvergentfanout and was developed by Goel in 1981 (Smith 14.5.3).

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

ab+bca

bc

Figure 7.1: Example Circuit with Fault Locations and Karnaugh Map

7.2.5.1 Collapse the Faults

Initial circuit with potential faults:

a

b

c

z

L7@0,1

L6@0,1

L8@0,1

L1@0,1

L2@0,1

L3@0,1

L4@0,1

L5@0,1

Gate collapsinga

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

@0

@0@0

L1@0, L4@0,L6@0a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

@0

@0@0

L3@0, L5@0,L7@0a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

@1

@1@1

L6@1, L7@1,L8@1

7.2.5 Generate Test Vectors for 100% Coverage 461

Node Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .Node collapsing: none applicable (no invertors or buffers).

Remaining faults:

a

b

c

z

L7@0

L6@0

L8@0,1

L1@1

L2@0,1

L3@1

L4@1

L5@1

Intelligent Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .Sometimes, after the regular forms of fault collapsing havebeen done, there will still be some setsof equivalent faults in the circuit. It is usually beneficialto quickly look for patterns orsymmetries in the circuit that will indicate a set of potentially equivalent faults.

Intelligent Collapsinga

b

c

zL8@0L2@0

L2@0,L8@0 Both L2@0 and L8@0 result in theequation 0.

a

b

c

z

L1@1

L3@1L1@1,L3@1 Both L1@1 and L3@1 result in the

equationb

Remaining faults:

a

b

c

z

L7@0

L6@0

L8@0,1L2@1

L3@1

L4@1

L5@1


7.2.5.2 Check for Fault Domination


1) L2@1 a+c

ab

c

ab

c

dominated by L4@1, L5@1

2) L3@1 b

ab

c

ab

c

3) L4@1 a+bc

ab

c

ab

c

4) L5@1 ab+c

ab

c

ab

c

5) L6@0 bc

ab

c

ab

c

6) L7@0 ab

ab

c

ab

c

7) L8@0 0

ab

c

ab

c

dominated by L6@0, L7@0

8) L8@1 1

ab

c

ab

c

dominated by L2@1, L3@1, L4@1, L5@1


Remove dominated faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .Current faults:

a

b

c

z

L7@0

L6@0

L8@0,1L2@1

L3@1

L4@1

L5@1

Dominated faults: (L2@1, L8@0, L8@1).fault eqn K-map Diff w/ ckt

1) L3@1 b

ab

c

ab

c

2) L4@1 a+bc

ab

c

ab

c

3) L5@1 ab+c

ab

c

ab

c

4) L6@0 bc

ab

c

ab

c

5) L7@0 ab

ab

c

ab

c

a

b

c

z

L7@0

L6@0

L3@1

L4@1

L5@1

7.2.5.3 Required Test Vectors

If we have any faults that are detected by just one test-vector, then wemust include that test vector in our suite.

Definition required test vector: A test vectortv is requiredif there is a fault for whichtv is the only test vector thatwill detect the fault.

Required vectorsL3@1 010L6@0 110L7@0 011

7.2.5.4 Faults Not Covered by Required Test Vectors


1) L4@1 a+bc

ab

c

ab

c

2) L5@1 ab+c

ab

c

ab

c

The intersection of the two difference regionsis 101.Choosing 101 detects both L4@1 and [email protected] 101 to suite of test vectors.Final set of test vectors is:010, 110, 011, 101.


7.2.5.5 Order to Run Test Vectors

The order in which the test vectors are run is important because it can affect how long a faultychip stays in the tester before the chip’s fault is detected.

The first vector to run should be the one that detects the most faults.

Build a table for which faults each test vector will detect.

Test Vector

fault

ab

c

ab

c

ab

c

ab

c

110 010 011 101

1) L1@0

ab

c

1

2) L1@1

ab

c

1

3) L2@0

ab

c

1 1

4) L2@1

ab

c

1

5) L3@0

ab

c

1

6) L3@1

ab

c

1

7) L4@0

ab

c

1

8) L4@1

ab

c

1

9) L5@0

ab

c

1

10) L5@1

ab

c

1

11) L6@0

ab

c

1

12) L6@1

ab

c

1 1

13) L7@0

ab

c

1

14) L7@1

ab

c

1 1

15) L8@0

ab

c

1 1

16) L8@1

ab

c

1 1Faults detected 5 5 5 6

101 detects the most faults, so we should run it first.


This reduces the faults found by 010 from 5 to 2 (because L6@1,L7@1, and L8@1 will be foundby 101).

This leaves 110 and 011 with 5 faults each, we can run them in either order, then run 010.

We settle on a final order for our test suite of: 101, 011, 110, 010.

7.2.5.6 Summary of Technique to Find and Order Test Vectors

1. identify all possible faults

2. gate collapsing

3. node collapsing

4. intelligent collapsing

5. fault domination

6. determine required test vectors

7. choose minimal set of test vectors to detect remaining faults

8. order test vectors based on number of faults detected (NOTE: when iterating through thisstep, need to take into account faults detected by earlier test vectors)


7.2.5.7 Complete Analysis

In case you don’t trust the fault collapsing analysis, here’s the complete analysis.


1) L1@0 bc

ab

c

ab

c

2) L1@1 b

ab

c

ab

c

3) L2@0 0

ab

c

ab

c

dominated by 1, 5

4) L2@1 a+c

ab

c

ab

c

dominated by 8, 10

5) L3@0 ab

ab

c

ab

c

6) L3@1 b same as 27) L4@0 bc same as 1

8) L4@1 a+bc

ab

c

ab

c

9) L5@0 ab same as 5

10) L5@1 ab+c

ab

c

ab

c

11) L6@0 bc same as 1

12) L6@1 1

ab

c

ab

c

dominated by 8, 1013) L7@0 ab same as 514) L7@1 1 same as 1215) L8@0 0 same as 316) L8@1 1 same as 12

7.2.6 One Fault Hiding Another 467

7.2.6 One Fault Hiding Another

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

Assume that we arenot trying to detectall faults — L1 is viewed as not being at risk for faults,but L3 is at risk for faults.

a

b

c

z

L1

L3

a

b

c

z

L1

L3

Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0.

In the presence of other faults, the set of test vectors to detect a fault will change.

fault(s) eqn K-map Diff w/ ckt

L3@0 ab

ab

c

ab

c

L1@1,L3@0 b

ab

c

ab

c

7.3 Scan Testing in General

Scan testing is based on the techniques described in section7.2.5. The generation of test vectorsand the checking of the result are done off-chip. In comparison, built-in self test (section 7.5)does test-vector generation and result checking on chip. Scan testing has the advantage offlexibility and reduced on-chip hardware, but increases thelength of time required to run a test. Inscan testing, we want to individually drive and read every flop in the circuit.

Even without using any I/O pins for testing purposes, chips are already I/O bound, so scan-testingmust be very frugal in its use of pins. Flops are connected together in “scan chains” with oneinput pin and one output pin.


7.3.1 Structure and Behaviour of Scan Testing

circuitundertest

data_in(3)

data_in(1)

data_in(2)

data_in(0)

zeta_in(3)

zeta_in(1)

zeta_in(2)

zeta_in(0)

ano

ther

cir

cuit

#0

ano

ther

cir

cuit

#1

Normal Circuit

circuitundertest

ano

ther

cir

cuit

yet

ano

ther

cir

cuit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan_out1

scan

ch

ain

0

scan

ch

ain

1

Circuit with Scan Chains Added

7.3.2 Scan Chains

7.3.2.1 Circuitry in Normal and Scan Mode

mode0 scan_in0

circuitundertest

scan_out0

mode1 scan_in1

scan_out1

data_in(3)

data_in(1)

data_in(2)

data_in(0)

zeta_in(3)

zeta_in(1)

zeta_in(2)

zeta_in(0)

Normal Mode

7.3.2 Scan Chains 469

mode0 scan_in0

circuitundertest

scan_out0

mode1 scan_in1

scan_out1

Scan Mode

7.3.2.2 Scan in Operation

circuitundertest

ano

ther

circ

uit

yet

ano

ther

circ

uit

mode0 scan_in0

scan_out0

mode1 scan_in1

scan

ch

ain

0

scan_out1

scan

ch

ain

0

Circuit under test with scan chains

Sequence of load; test; unload

Load Test Vector(1 cycle per bit)

Run Test VectorThrough Circuit

Unload Result(1 cycle per bit)

Unload and Load and Same Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .

Unload Prev ResultLoad Cur Test Vector

(1 cycle per bit)

Run Cur Test VectorThrough Circuit

Unload Cur ResultLoad New Test Vector

(1 cycle per bit)

clk

scan_in0

mode0

scan_out1

next testvector0

previousresults1

scan_out0

scan_in1 currentvector1

currentresults0

previousresults0currentvector0

next testvector1

currentresults1

Sequence of load; run; unload


7.3.2.3 Scan in Operation with Example Circuit

a b

c z

d

y

Circuit under test

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

Circuit under test with scan test circuitry


mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

δδ

Start Loading Test Vector (Loadδ)

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

γ γ δ

δδ

Loadγ


mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

δ

β

γ

δδ

γ

γβ

Loadβ

mode0 scan_in0

a

b

c

z

d

y

mode1 scan_in1

scan_out0 scan_out1

clk

mode0

α α β

β β γ

γ γ

δδ

δ

Loadα


mode0 scan_in0 mode1 scan_in1

scan_out0 scan_out1

clk

mode0

β β

α βα

α

γ

γ γ δ

Run Test Vector


scan_out0 scan_out1

clk

mode0

α

α α β

β β γ

γ γ δ

αβ

α__

δ

β__

γ

βδ

αβ+β__

γ

α__

δ+βδ

Test Values Propagate


(α__

δ+βδ)


scan_out0 scan_out1

−-

clk

mode0

δ’ δ’

αβ+β__

γ

α__

δ+βδ

Flop-In Result, Start (Un)loading Test Vector


scan_out0 scan_out1

−

−

αβ+β__

γ

(α__

δ+βδ, αβ+β__

γ)

−

−−

clk

mode0

δ’

δ’ δ’

γ’ γ’

Continue (Un)loading Test Vector

7.3.3 Summary of Scan Testing 475


scan_out0 scan_out1

−

−

−

clk

mode0

−

ζζ

γ’

γ’ γ’ δ’

δ’ δ’

β’ β’

(α__

δ+βδ, αβ+β__

γ)

Finish (Un)loading Test Vector


scan_out0 scan_out1

(α__

δ+βδ, αβ+β__

γ)clk

mode0

−ζ

ζ

ζ

ψψ

β’

β’ β’ γ’

γ’ γ’ δ’

δ’ δ’ δ’

α’ α’

Run Next Test Vector

7.3.3 Summary of Scan Testing

• Adding scan circuitry

1. Registers around circuit to be tested are grouped into scan chains

2. Replace each flop with mux + flop

3. Flops and muxes wired together into scan chains

4. Each scan chain is connected to dedicated I/O pins for loading and unloading testvectors


• Running test vectors

1. Put scan chain in “scan” mode

2. Load in test vector (one element of vector per clock cycle)

3. Put scan chain in “normal” mode

4. Run circuit for one clock cycle — load result of test into flops

5. Unload results of current test vector while simultaneously loading in next test vector(one element of vector per clock cycle)

7.3.4 Time to Test a Chip

If the length (number of flops) of a scan chain isn, then it takes 2n+1 clock cycles to run a singletest:n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, andn cyclesto scan out the results. Once the results are scanned out, they can be compared to the expectedresults for a correctly working circuit.

If we run 2 or more tests (and chips generally are subjected tohundreds of thousands of tests),then we speed things up by scanning in the next test vector while we scan out the previous result.

ScanLength = number of flip flops in a scan chainNumVectors = number of test vectors in test suiteTimeScan = number of clock cycles to run test suite

= NumVectors× (ScanLength +1)+ScanLength

7.3.4.1 Example: Time to Test a Chip

A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, andtwo of 15,000 bits.

500,000 test vectors are used for each scan chain.

The tests are run at 80% of full speed.

Question: Calculate the total test time.

Answer:

We can load and unload all of the scan chains at the same time, so time willbe limited by the longest (22,000 bits).

7.4. BOUNDARY SCAN AND JTAG 477

For the first test vector, we have to load it in, run the circuit for one clockcycle, then unload the result.

Loading the second test vector is done while unloading the first.

TimeTot = ClockPeriod×(MaxLengthVec +NumVecs× (MaxLengthVec +1))

= (1/(0.80×800×106))× (22,000+500,000× (22,000+1))= 17secs

7.4 Boundary Scan and JTAGBoundary scan originated as technique to test wires on printed circuit boards (PCBs).

Goal was to replace “bed-of-nails” style testing with technique that would work for high-densityPCBs (lots of small wires close together)

Now used to test both boards and chip internals.

Used both on boundaries (I/O pins) and internal flops.

Boundary Scan with JTAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . .Standardized by IEEE (1149) and previously by JTAG:• 4 required signals (Scan Pins:TDI , TDO, TCK, TMS)

• 1 optional signal (Scan Pin:TRST)

• protocol to connect circuit under test to tester and other circuits

• state machine to drive test circuitry on chip

• Boundary Scan Description Language (BSDL): structural language used to describe whichfeatures of JTAG a circuit supports

JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is aJTAG circuit custom-built as part of a larger part. So, you’ll probably be choosing and usingJTAG circuits, not constructing new ones.

Using JTAG circuitry is usually done by giving a descriptionof your printed circuit board (PCB)and the JTAG components on each chip (in BSDL) to test generation software. The software thengenerates a sequence of JTAG commands and data that can be used to test the wires on the circuitboard for opens and shorts.

7.4.1 Boundary Scan History

1985 JETAG: Joint European Test Action Group

1986 JTAG (North American companies joined)

1990 JTAG 2.0 formed basis for IEEE 1491 “Test access port and boundary scan architecture”


7.4.2 JTAG Scan Pins

TDI −→ test data input:input testvector to chip

TDO ←− test data output:output result of test

TCK −→ test clock:clock signal that test runs on

TMS −→ test mode select:controls scan state machine

TRST −→ test reset (optional):resets the scan state machine

scan registers

TDI TDOTCK TMS

circuitundertest

chip

control

normalinputpins

normaloutputpins

High-level view

BSC

BSC

BSC

BR

IR

IDCODE

TAP Controller

BSR

TDI TDO

TCK

TMS

IRC IRC

circuitundertest

chip

Instruction Decoder

BSC

BSC

BSC

control

Detailed view

7.4.3 Scan Registers and Cells

Basic Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .TDR Test data register

The boundary scan registers on a chipDR Fig 14.2 Data register cell

Often used as a Boundary scan cell (BSC)

JTAG Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .

7.4.4 Scan Instructions 479

Fig 14.8 Top level diagramBSR Fig 14.5 Boundary scan register

A chain of boundary scan cells (BSCs)BSC Fig 14.2 Boundary scan cell

Connects external input and scan signal to internal circuit. Acts aswire between external input and internal circuit in normal mode.

BR Fig 14.3 Bypass-register cellAllows direct connection from TDI to TDO. Acts as a wire whenexecuting BYPASS instruction.

IDCODE Device identification registerdata register to hold manufacturer’s name and chip identifier. Usedin IDCODE instruction.

IR cell Fig 14.4 Instruction register cellCells are combined together as a shift register to form an instructionregister (IR)

IR Fig 14.6 Instruction registerTwo or more IR cells in a row. Holds data that is shifted in on TDI,sends this data in parallel to instruction decoder.

IDecode Table 14.4 Instruction decoderReads instruction stored in instruction register (IR) and sends controlsignals to bypass register (BR) and boundary scan register (BSR)

Fig 14.7 TAP ControllerState machine that, together with instruction decoder, controls thescan circuitry.

7.4.4 Scan Instructions

This the set of required instructions, other instructions are optional.

EXTEST Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs.

SAMPLE Sample result dataPRELOAD Load test vectorBYPASS Directly connect TDI to TDO. This is used when several chips are

daisy chained together to skip loading data into some chips.IDCODE Output manufacturer and part number

7.4.5 TAP Controller

The TAP controller is required to have 16 states and obey the state machine shown in Fig 14.7 ofSmith.


7.4.6 Other descriptions of JTAG/IEEE 1194.1

Texas Instruments introductory seminar on IEEE 1149.1http://www.ti.com/sc/docs/jtag/seminar1.pdf

Texas Instruments intermediate seminar on IEEE 1149.1http://www.ti.com/sc/docs/jtag/seminar2.pdf

Sun midroSPARC-IIep scan-testing documentationhttp://www.sun.com/microelectronics/whitepapers/wpr -0018-01/

Intellitech JTAG overview:http://www.intellitech.com/resources/technology.htm l

Actel’s JTAG description:http://www.actel.com/appnotes/97s05d15.pdf

Description of JTAG support on Motorola Coldfile microprocessor:http://e-www.motorola.com/collateral/MCF5307TR-JTAG .pdf

7.5. BUILT IN SELF TEST 481

7.5 Built In Self Test

With built-in self test, the circuit tests itself. Both testvector generation and checking are doneusing linear feedback shift registers (LFSRs).

7.5.1 Block Diagram

mode

circuitundertest

data_in(0)

data_in(2)

data_in(1)

data_in(3)

testgenerator

signatureanalyzer0

signatureanalyzer1

signatureanalyzer2

signatureanalyzer3

resultchecker

all_ok

test gen LFSR

d_out(0)

d_out(1)

d_out(2)

d_out(3)

diz(0)

diz(1)

diz(2)

diz(3)

ok(0)

ok(1)

ok(2)

ok(3)

BIST

7.5.1.1 Components

There is one test generator per group of inputs (or internal flops) that drive the same circuit to betested.

There is one signature analyzer per output (or internal flop).

Note: MISR An exception to the above rule is a multiple input signatureregister (MISR), which can be used to analyze several outputs of the circuitunder test.

The test generator and signature analyzer are both built with linear-feedback shift registers.


Test Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .• generates a psuedo-random set of test vectors

• for n output bits, generates all vectors from 1 to 2n−1 in a pseudo random order

• built with a linear-feedback shift register (shift-register portion is the input flops)

The figure below shows an LFSR that generates all possible 3-bit vectors except 000. (Ann bitLFSR that generates 2n−1 different vectors is called a “maximal-length LFSR”.)

Assume that reset initializes the circuit to111. The sequence that is generated is: 111,011, 001, 100, 010, 101, 110. This sequenceis repeated, so the number after 110 is 111.

q2q1q0

Question: Why not just use a counter to generate 1..2n−1?

Answer:

• An LFSR has less area than an incrementer. Just a few XOR gates for anLFSR, compared to a half-adder per bit for an incrementer.

• There is a strong correlation between consecutive test vectors generatedby an incrementer, while there is no correlation between consecutive testvectors generated by an LFSR. When doing speed binning, if consecutivetest vectors should generate the same output, we cannot distringuishbetween a slow critical path and a correctly working circuit.

Signature Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .Checking is done by building one signature analyzer circuitfor each signal tested. The circuitreturns true if the signal generates the correct sequence ofoutputs for the test vectors. Doing thiswith complete accuracy would require storing 2n bits of information for each output for a circuitwith n inputs. This would be as expensive as the original circuit. So, BIST uses mathematicssimilar to error correction/detection to approximate whether the outputs are correct. Thistechnique is called “signature analysis” and originated with Hewlett-Packard in the 1970s.

The checking is done with an LFSR, similar to the BIST generation circuit. The checking circuitis designed to output a 1 at the end of the sequence of 2n−1 test results if the sequence of resultsmatches the correct circuit. We could do this with an LFSR of 2n−1 flops, but as said before, thiswould be at least as expensive as duplicating the original circuit.

The checking LFSR is designed similarly to a hashing function or parity checking circuit. If itreturns 0, then we know that there is a fault in the circuit. Ifit returns a 1, then there is probablynot a fault in the circuit, but we can’t say for sure.

7.5.1 Block Diagram 483

There is a tradeoff between the accuracy of the analyzer and it’s area. The more accurate it is, themore flip flops are required.

Summary: the signature analyzer:• checks that the output it is examining has the correct results for thecomplete set of teststhat

are run

• only has a meaningful result at the end of the entire test sequence.

• built with a linear-feedback shift register

• similar to a hash function or a lossy compression function

• if there areno faults, the signature analyzer will definitely say “ok” (no false negatives)

• if there is a fault, the signature analyzermight say “ok” or might say “bad” (false positives arepossible)

• design tradeoff: more accurate signature analyzers require more hardware

Result Checker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .• signature analyzers output “ok”/”bad” on every clock cycle, but the result is only meaningful at

the end of running the complete set of test vectors

• the result checker looks at test vector inputs to detect the end of the test suite and outputs“all ok” if all signature analyzers report “ok” at that moment

• implemented as anAND gate

7.5.1.2 Linear Feedback Shift Register (LFSR)

Basically, a shift register (sequence of flip-flops) with theoutput of the last flip-flop fed back intosome of the earlier flip-flops withXOR gates.

Design parameters:• number of flip-flops

• external or internalXOR

• feedback taps (coefficients)

• external-input or self-contained

• reset or set

Example LFSRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .

S

R

S

R

S

R

reset

d0 q0 d1 q1 d2 q2i

External-XOR, input, reset

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

External-XOR, no input, set


S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2i

Internal-XOR, input, set

S

R

S

R

S

R

reset

d0 q0 d1 q1 d2 q2i

Internal-XOR, input, reset

In E&CE 327, we useinternal- XOR LFSR’s, because the circuitry matches the mathematics ofGalois fields.

External-XOR LFSR’s work just fine, but they are more difficult to analyze, because theirbehaviour can’t be treated as Galois fields.

7.5.1.3 Maximal-Length LFSR

Definition maximal-length linear feedback shift register: An LFSR that outputs apseudo-random sequence of all representable bit-vectors except0...00 .

Definition pseudo random: The same elements in the same order every time, but therelationship between consecutive elements is apparantly random.

Maximal-length linear feedback shift registers are used togenerate test vectors for built-in selftest.

Maximal-Length LFSR Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .The figures below illustrate the two maximal-length internal-XOR linear feedback shift registersthat can be constructed with 3 flops.

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

Maximal-length internal-XOR LFSR

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

Maximal-length internal-XOR LFSR

7.5.2 Test Generator 485

Question: Why do maximal-length LFSRs not generate the test vector0...00?

Answer:If all flops had 0 as output, then the LFSR would get stuck at 0 and would

generate only 0...00 in the future.

Maximal-length LFSRs:

• set to all 1s initially

• self contained (no externali input)

clk

d0

q0

reset

d1

q1

val 6 4 1 2 5 3 77

q2

1 2 3 4 5 6 7 8

6

Timing diagram for a 3-flop maximal-length LFSR

7.5.2 Test Generator

The test generator component is a maximal-length LFSR with multiplexors on the inputs to eachflip-flop. In test mode, the data input on each flip flop is connected to the output of the previousflip flop. In normal mode, the input of each flip flop is connectedto the environment.

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

i_d(0) i_d(1) i_d(2)

mode

q2q1q0


7.5.3 Signature Analyzer

There are four things that change between different signature analyzers:• number of flops (⇑ flops=⇒ ⇑ area,⇑ accuracy)

• choice of feedback taps: a good choice can improve accuracy (more isn’t necessarily better)• bubbles on input to AND gate for “ok”: determined by expectedresult from simulating test

sequence through circuit under test and LFSR of analyzer.

This circuit:

• Two flops, most analyzers usemore — the HP boards in the 1970sused 37 flops!

• Feedback taps on both flops. Differ-ent signature analyzers have differ-ent configurations of feedback taps.

• Also contains “ok” tester (AND

gate). Expected output of LFSR atend of test sequence is:q0=1 andq1=1 , or 01 . (We know this be-cause of bubble onAND gate. To seewhy this is the expected output of thesignature analyzer, we would need toknow the correct sequence of outputsof the circuit under test.)

S

R

S

R

reset

d0 q0 d1 q1i

ok

clk

q0

q1

reset

0

0

i6

i60

i i6 i5 i4 i3 i2 i1 i0

356 = i3⊕i5⊕i62356 = i2⊕i3⊕i5⊕i6etc...

-

d0 i6 i5 -

d1 0 i6 i5⊕i6

i5

i5⊕i6

i4⊕i6

i4⊕i6

356

356

i4⊕i5

i4⊕i5

346

346

245

245

2356

2356

1346

1346

02356

02356

1245

1245

-

7.5.4 Result Checker

The purpose of the result checker is to check the “ok” circuitat the end of the test sequence. Todo this, we need to recognize the end of the test sequence. Thesimplest way to do this is to noticethat the first test vector is all1s and that the test vector sequence will repeat as long as the circuitis in test mode.

We want to sample the “ok” signal one clock cycle after the sequence is over. This is the same asthe first clock cycle of the second test sequence. In this clock cycle, the output of the testgenerator will be all1s andreset will be 0. We need to look at reset, because otherwise wecould not distinguish the first sequence (whenreset is 1) from the subsequenct sequences.

q0 q1 all_ok

reset

q2 ok

7.5.5 Arithmetic over Binary Fields 487

7.5.5 Arithmetic over Binary Fields

• Galois Fields!

• Two operations: “+” and “×”

• Two values: 0 and 1

• Bit vectors and shift-registers are written as polynomialsin terms ofx.

+ representsXOR

expression result0+0 00+1 11+0 11+1 0x+x 0

× represents concatenating shift registers

expression resultx4×1 x4

x2×x3 x5

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .Calculate(x3 +x2 +1)× (x2+x)

x2 × (x3 +x2 +1) = x5 + x4 + x2

x × (x3 +x2 +1) = x4 + x3 + xx5 + x3 + x2 + x

7.5.6 Shift Registers and Characteristic Polynomials

Each linear feedback shift register has a corresponding characteristic polynomial.

The exponents in the polynomial correspond to the delay:x0 is the input to the shift register,x1 isthe output of the first flip-flop,x2 is the output of the second, etc. The coefficient is 1 if the outputfeeds back into the flip flop. Usually (Internal flops or input flops with an external input), thefeedback is done via anXOR gate. For input flops without an external input signal, the feedback isdone directly, with a wire. The non-existant external inputis equivalent to a0, and0 XOR asimplifies toa, which is a wire.

From polynomials to hardware:

• The maximum exponent denotes the number of flops

• The other exponents denote the flops that tap off of feedback line from last flop

• From the characteristic polynomial, we cannot determine whether the shift register has anexternal input. Stated another way, two shift registers that are identical except that one has anexternal input and the other does not will have the same characteristic polynomial.


S

R

S

R

reset

d0 q0 q1i

S

R q2

p(x) = x3

S

R

S

R

reset

d0 q0 q1

S

R q2d1i

x0 x1 x2 x3

p(x) = x3 +x

S

R

S

R

reset

d0 q0 q1i

S

R q2

x0 x1 x2 x3

p(x) = x3 +1

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2

x0 x1 x2 x3

p(x) = x3 +x+1

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2d2

x0 x1 x2 x3

p(x) = x3 +x2 +x+1

S

R

S

R

reset

d0 q0 d1 q1i

S

R q2

S

R q3d3

x0 x1 x2 x3 x4

p(x) = x4 +x3 +x+1

7.5.7 Bit Streams and Characteristic Polynomials 489

7.5.6.1 Circuit Multiplication

Redoing the multiplication example(x2 +x)× (x3 +x2 +1) as circuits:

x2 +x

x3 +x2 +1

(x2+x)× (x3 +x2 +1)

= x× (x3 +x2 +1)

+ x2× (x3 +x2 +1)

= x5 +x3 +x2 +x

The flop for the most-significant bit is represented by a coeffcient of 1 for the maximum exponentin the polynomial. Hence, MSB of the first partial product cancels thex4 of the second partialproduct, resulting in a coefficient of 0 forx4 in the answer.

7.5.7 Bit Streams and Characteristic Polynomials

A bit stream, or bit sequence, can be represented as a polynomial.

The oldest (first) bit in a sequence ofn bits is represented byxn−1 and the youngest (last) bit isx0.

The bit sequence 1010011 can be represented asx6 +x4 +x+1:

1 0 1 0 0 1 1= 1x6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0

= x6 +x4 +x+1

7.5.8 Division

With rules for multiplication and addition, we can define division.

A fundamental theorem of division definesq andr to be the quotient and remainder, respectively,of m÷ p iff:

m(x) = q(x)× p(x)+ r(x)


In Galois fields, we do division just as with long division in elementary school.

Given:

m(x) = x6 +x4 +x3

p(x) = x4 +x

Calculate the quotient,q(x) and remainderr(x) for m(x)÷ p(x):

x2 + 1x4 +x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0

x6 + 1x3

1x4

1x4 + xx

Quotient q(x) = x2 +1Remainder r(x) = x

Check result:

m(x) = q(x) × p(x) + r(x)= (x2 +1) × (x4+x) + x= x6 +x3 +x4 +x + x= x6 +x4 +x3

7.5.9 Signature Analysis: Math and Circuits

The input to the signature analyzer is a “message”,m(x), which is a sequence ofn bitsrepresented as a polynomial.

After n shifts through an LFSR withl flops:

• The sequence of output bits forms a quotient, q(x), of lengthn− l

• The flops in the analyzer form a remainder, r(x), of lengthl

m(x) = q(x)× p(x)+ r(x)

The remainder is the signature.

The mathematics for an LFSR without an inputi :

• same polynomial as if the circuit had an input

7.5.10 Summary 491

• input sequence is all0s

An input stream with an error can be represented asm(x)+e(x)

• e(x) is the error polynomial

• bits in the message that are flipped have a coefficient of 1 ine(x)

m(x)+e(x) = q′(x)× p(x)+ r ′(x)

The errore(x) will be detected if it results in a different signature (remainder).

m(x) andm(x)+e(x) will have the same remainder iff

e(x) mod p(x) = 0

That ise(x) must be a multiple ofp(x).

The largerp(x) is, the smaller the chances thate(x) will be a multiple ofp(x).

7.5.10 Summary

Adding test circuitry

1. Pick number of flops for generator

2. Build generator (maximal-length linear feedback shift register)

3. Pick number of flops for signature analysis

4. Pick coeffecients (feedback taps) for analyzer

5. Based on generator, circuit under test, and signature analyzer; determine expectedoutput of analyzer

6. Based on expected output of analyzer, build result checker

Running test vectors

1. Put circuit in test mode

2. Set reset = 1

3. Run one clock cycle, set reset = 0

4. Run one clock cycle for each test vector

5. At end of test sequence, all ok signals should be 1

6. To runn test vectors requiresn+1 clock cycles.


BIST for a Simple Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .Outline of steps to see if a fault will be detected by BIST:

1. Output sequence from test generator

2. Output sequence from correct circuit

3. Remainder for signature analyzer with correct output sequence

4. Output sequence from faulty circuit

5. Remainder for signature analyzer with faulty output sequence

6. Compare correct and faulty remainder, if different then fault detected

Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

a

b z

a

L1

L2

L3

L4

L5

L6

L7L8

t0 t1 t2D QD QD Q

r0 r1 r2D QD QD Qz

7.5.10 Summary 493

t0 t1 t2t0 t1 t2a b c

corr

ect

faul

ty

z z

z r0 r1 r2 z r0 r1 r2

Question: Determine if L2@1 will be detected


Test Generation Sequence

t0 t1 t2

1 1 11

1

11

11

1

1

initial values = 101

111

00

0

00

0

00

01

111

00

final values are repeatof initial values

Technique is to shift; then compute result ofXORs

Equation for correct circuit:ab+bcEquation for faulty circuit:a+cOutput sequences for correct and faulty circuits

t0 t1 t2a b c

corr

ect

faul

ty

z z1 1 11

1

1

1

1

0

00

0

00

01

11

00

1

vectors from test generationsequence

1110000

output sequencesfrom circuits

1111

11

0

7.5.10 Summary 495

Signature analyzer sequence for correct Circuit

z r0 r1 r21110000

0 0 0

output sequencefrom correct circuit

initialvalues = 0

1111001

111100

remainder

011

1

1

0

0

0011

1

1

0

0

01

11

00

001

11

00

1

Signature analyzer sequence for faulty circuit

z r0 r1 r2

output sequencefrom correct circuit

initialvalues = 0

remainder

11

1111

11

0

1 0 0 00 011

11

00

111

1

00

110001

011000

010000

0010000


7.6 Scan vs Self TestScan

⇑ less hardware

⇓ slower

⇑ well defined coverage

⇑ test vectors are easy to modify

Self Test

⇓ more hardware

⇑ faster

⇓ ill defined coverage

⇓ test vectors are hard to modify

7.7. PROBLEMS ON FAULTS, TESTING, AND TESTABILITY 497

7.7 Problems on Faults, Testing, and Testability

P7.1 Based on Smith q14.9: Testing Cost

A modern (circa 1995) production tester costs US$5–10 million. This cost is depreciated over thelife of the tester (usually five years in the States due to tax guidelines).

1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours aday, 365 days per year how much does one second of test time cost?

2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behindschedule. After the chips begin shipping, the tester is used100% of the time. What is thecost of testing the chips relative to the cost if the chips hadbeen completed on time?

3. The dimensions of the die to be tested are 20mm×10mm. The wafers are 200mm indiameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that thenumber of die per wafer is equal to wafer area divided by chip area.

What percentage of the fabrication + test cost is for test if the chip is on schedule andrequires 1 minute to test?

P7.2 Testing Cost and Total Cost

Given information:


• Each board uses two ACHIPs (plus lots of other chips that we don’t care about)



• Each 50% reduction in fault escapees doubles cost of testing(intuition: doubles number oftests that are run)

• If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replacethe ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is muchless than the total cost of $200).


What fault escapee rate will result in the lowest total cost for ACHIPs?


P7.3 Minimum Number of Faults

In a circuit with i inputs,o outputs, andg gates with an average fanout offo (fo > 1), and averagefanin offi, what is the minimum number of faults that must be consideredwhen using asingle-stuck-at fault model?

P7.4 Smith q14.10: Fault Collapsing

Draw the set of faults that collapse for AND, OR, NAND, and NORgates, and a two-input mux.

P7.5 Mathematical Models and Reality

Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at faultmodel detect the fault? If so, identify a single-stuck at fault that will detect, or explain why can’tbe detected.

P7.6 Undetectable Faults

Identify oneof the undetectable single stuck-at fault in the circuit below, or say “NONE” if allsingle stuck-at faults are detectable.a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

P7.7 Test Vector Generation

Your task is to generate test vectors to detect faults in the circuit shown below.

Your manager has said that manufacturing only has time to runthree test vectors on the circuit.

a

b

c

L1

L2

L3

L4

L5

L6

L7L8

P7.8 Time to do a Scan Test 499

P7.7.1 Choice of Test Vectors

Which test vectors should you run and in what order should yourun them?

P7.7.2 Number of Test Vectors

Write a brief statement (justified by data) to support eitherstaying with three test vectors orincreasing the test suite to four vectors.

P7.8 Time to do a Scan Test

A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, andtwo of 12,000 bits.



Calculate the total test time.

P7.9 BIST

In this problem, we will revisit the circuit from section 7.2.5, which is shown below. But, thistime we’ll use BIST to test the circuit, rather than analyzing the faults and then choosing testvectors to catch the potential faults.

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

P7.9.1 Characteristic Polynomials

Derive the characteristic polynomials for the linear feedback shift registers shown below:

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2


P7.9.2 Test Generation

Do either of the circuits generate a maximal-length non-repeating sequence?

P7.9.3 Signature Analyzer

Given a signature analyzer equation ofx2 +x+1, find the expected value of the flops in thesignature analyzer at the end of the test sequence. Also, design the hardware for the signatureanalyzer and result checker.

P7.9.4 Probabilty of Catching a Fault

Find the approximate probability of a fault not being detected


If we increase the size of the signature analyzer by one flip flop, by how much do we change thethe approximate probability of a fault not being detected?

P7.9.6 Detecting a Specific Fault

Determine if a L7@0 is detectable

P7.9.7 Time to Run Test

Find the number of clock cycles to run the test

P7.10 Power and BIST

You add a BIST circuit to a chip. This causes the chip to exceedthe power envelop that marketinghas dictaed is needed. What can you do to reduce the power consumption of the chip withoutnegatively affecting performance or incuring significant design effort?

P7.11 Timing Hazards and Testability 501

P7.11 Timing Hazards and Testability

This question deals with with following circuit:

a

b

c

z

L1

L2

L3

L7

L8

L4L9

L5

L6

L10

L11

L12

L13

L14

L15

1. Does the circuit have any untestable single-stuck-at faults? If so, identify them.

2. Does the circuit have any static timing hazards?

3. Add any circuitry needed to prevent static timing hazardsin the circuit below, then identifyany untestable single-stuck-at faults in the resulting circuit.

P7.12 Testing Short Answer

P7.12.1 Are there any physical faults that are detectable byscan testing but not by built-inself testing?

If not, explain why. If so, describe such a fault.

P7.12.2 Are there any physical faults that are detectable bybuilt-in self testing but not byscan testing?


P7.13 Fault Testing

In this question, you will design and analyze built-in self test circuitry for the circuit-under-testshown below.


P7.13.1 Design test generator

Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate thatit is maximal length.

P7.13.2 Design signature analyzer

Design a signature analyzer circuit for a characteristic polynomial ofx+1.

P7.13.3 Determine if a fault is detectable

Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that you’vedesigned?

P7.13.4 Testing time

How many clock cycles does your BIST circuitry require to test the circuit under test? Explainhow each clock cycle is used.

Chapter 8

Review

This chapter lists the major topics of the term. The “Topics List” section for each major area ismeant to be relatively complete.

8.1 Overview of the Term

• The purely digital world– VHDL

– design and optimization methods

– functional verification

– performance analysis

• Analog effects in the digital world– timing analysis

– power

– faults and testing

503

504 CHAPTER 8. REVIEW

8.2 VHDL

8.2.1 VHDL Topics

• simple syntax and semantics — things that you should know simply by having done the labsand project

• behavioural semantics of VHDL

• synthesis semantics of VHDL

• synthesizable and unsynthesizable code

8.2.2 VHDL Example Problems

• identify whether a particular signal will be the output of combinational circuitry or a flop

• identify whether a particular process is combinational or clocked

• legal, synthesizable, and good code

• perform delta-cycle simulation of VHDL

• perform RTL simulation of VHDL

• identify whether two VHDL fragments have same behaviour

• match VHDL code with waveforms

• match VHDL code with hardware

• choose the VHDL fragment that generates smaller or faster hardware

8.3. RTL DESIGN TECHNIQUES 505

8.3 RTL Design Techniques

8.3.1 Design Topics

• coding guidelines

• generic FPGA hardware

• area estimation

• finite state machines

– implicit

– explicit-current

– explicit-current+next

• from algorithm to hardware

– dependency graph

– dataflow diagram

– scheduling

– input/output allocation

– register allocation

– datapath allocation

– hardware block diagram

– state machine

• memory dependencies

• memory arrays and dataflow diagrams

8.3.2 Design Example Problems

• choose design guidelines to follow in different situations

• estimate area to implement a circuit in an FPGA

• calculate resource usage for a dataflow diagram

• calculate performance data for a dataflow diagram

• given an algorithm, design a dataflow diagram

• given a dataflow diagram, design the datapath and finite statemachine

• optimize a dataflow diagram to improve performance or reduceresource usage

• given a dataflow diagram, calculate the clock period that will result in the maximumperformance


8.4 Functional Verification

8.4.1 Verification Topics

• test cases

• measuring coverage

• time for verification

• test benches

• assertions

• coverage monitors

• relational specification

• functional specification

• boundary conditions / corner cases

8.4.2 Verification Example Problems

• choose first cases to test

• identify corner cases

• choose technique to detect bug (test case, assertion/test bench)

• determine whether a code change will cause a bug

• identify a test case and either assertion or test bench to catch a bug

8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION 507

8.5 Performance Analysis and Optimization

8.5.1 Performance Topics

• time to execute a program

• definition of performance

• speedup

• n% faster

• calculating performance of different different tasks and of average task

• choosing which task to optimize to best improve overall performance

• cpi calculations

• performance increase over time

• design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market)

• CPI calculations

• MIPs calculations

• Clock speed vs. performance

• Optimality — performance / area tradeoffs

8.5.2 Performance Example Problems

• calculate performance / area tradeoffs

• calculate performance / time tradeoffs

• compare performance data between products

• evaluate performance criteria


8.6 Timing Analysis

8.6.1 Timing Topics

• circuit parameters that affect delay

– clock period

– clock skew

– clock jitter

– propagation delay

– load delay

– setup time

– hold time

– clock-to-Q time

• timing analysis of latch

• timing analysis of master-slave flip-flop

• timing analysis of hierachical storage device

• critical path and false path

– algorithm to find critical path

– algorithm to determine if path is false or critical

– signal assignment to exercise critical path

• elmore timing model

• derating factors

8.6.2 Timing Example Problems

• timing parameters for minimum clock period

• timing parameters for hold constraint

• find the critical path and assignment to exercise it

• compute elmore delay constant

• compare accuracy of different timing models

• determine if a storage device will work correctly

• compute timing parameters of storage device

• identify timing violation, suggest remedy

• suggest design change to increase clock speed

8.7. POWER 509

8.7 Power

8.7.1 Power Topics

• power vs energy

• equations for power

– dynamic power

– static power

– switching power

– short circuit power

– leakage power

– activity factor

– leakage current

– threshold voltage

– supply voltage

• analog power reduction techniques

• rtl power reduction techniques

– data encoding

– clock gating

8.7.2 Power Example Problems

• predict effect of new fabrication process on power

• predict effect of environment change (temp, supply voltage, etc) on power consumption

• predict effect of design change on power consumption (capacitance, activity factor)

• design data-encoding scheme for a circuit, predict effect on power consumption

• design clock gating scheme for a circuit, predict effect on power consumption

• asses validity of various power- or energy-consumption metrics


8.8 Testing

8.8.1 Testing Topics

• causes of faults

• locations of faults

• physical faults

• single stuck-at fault model

• testable / untestable fault

• economics of testing

• fault coverage

• test vector generation

• order test vectors to reduce test time

• behaviour of a scan chain

• time to run a scan test

• JTAG

• built-in self-test

• linear feedback shift register

• signature analyzer

• Galois fields

• process and time to run a BIST test

8.8.2 Testing Example Problems

• compute optimal amount of testing to maximize profits

• compute coverage for a given set of test vectors

• find test vectors to catch a set of faults, choose order to run test vectors

• determine if a fault is detectable

• choose an LFSR to use for BIST test generation

• choose an LFSR to use for BIST signature analysis

• determine if a given BIST will catch a given fault

• determine probability that a given BIST technique will report that a faulty circuit is correct

• determine if a given fault-testing scheme will detect a physical fault

• match LFSR to characteristic polynomial

• match BIST hardware to Galois mathematics

• perform Galois field mathematics, compare to waveforms

8.9. FORMULAS TO BE GIVEN ON FINAL EXAM 511

8.9 Formulas to be Given on Final Exam

T =Ins×C

F

Pf =W

T

S =T1

T2

M =F/106

(n

∑i=0

PIi×Ci)

P =12(A×CL×V

2×F)+(τ×A×V× ISh×F)+(V× IL)

q = 1.60218×10−19C

k = 1.38066×10−23J/K

F ∝(V−VTh)2

V

IL ∝ e

−q×VTh

k×T

0 CHAPTER 8. REVIEW

Part II

Solutions to Assignment Problems

1

Chapter 1

VHDL Problems

P1.1 IEEE 1164

For each of the values in the list below, answer whether or notit is defined in theieee.std_logic_1164 library. If it is part of the library, write a 2–3 word description of thevalue.

Values:’-’ , ’#’ , ’0’ , ’1’ , ’A’ , ’h’ , ’H’ , ’L’ , ’Q’ , ’X’ , ’Z’ .

Answer:

In std logic 1164?Yes No Description

’-’ X don’t care’#’ X’0’ X strong 0’1’ X strong 1’A’ X’h’ X’H’ X weak 1’L’ X weak 0’Q’ X’X’ X strong unknown’Z’ X high impedance

NOTE: ’h’ is not in the package, because characters are case sensitive. Forexample ’a’ is different than ’A’.

3

4 CHAPTER 1. VHDL PROBLEMS

P1.2 VHDL Syntax

Answer whether each of the VHDL code fragments q2a through q2f is legal VHDL code.

NOTES: 1) “... ” represents a fragment of legal VHDL code.2) For full marks, if the code is illegal, you must explain why.3) The code has been written so that, if it is illegal, then it is illegal for both

simulation and synthesis.

q2a architecture main of anchiceratops issignal a, b, c : std_logic;

beginprocess begin

wait until rising_edge(c);a <= if (b = ’1’) then

...else

...end if;


ILLEGAL: if-then-else is a state-ment, not an expression, you can’t put itif-then-else on right-hand-side of as-signment since it doesn’t produce a valueto assign to signal a.

q2b architecture main of tulerpeton isbegin

lab: for i in 15 downto 0 loop...

end loop;end main;

ILLEGAL: loop statements are sequen-tial, while architecture bodies containconcurrent statements.

P1.2. VHDL SYNTAX 5

q2c architecture main of metaxygnathus issignal a : std_logic;

beginlab: if (a = ’1’) generate

...end generate;

end main;

ILLEGAL: condition for if-generatestatements must be statically determined;testing the value of a signal is dynamic.

q2d architecture main of temnospondyl iscomponent compa

port (a : in std_logic;b : out std_logic

);end component;signal p, q : std_logic;

begincoma_1 : compa

port map (a => p, b => q);...

end main;

LEGAL


q2e architecture main of pachyderm isfunction inv(a : std_logic)

return std_logic isbegin

return(NOT a);end inv;signal p, b : std_logic;

beginp <= inv(b => a);...

end main;

ILLEGAL: the argument to inv shouldbe (a => b) . In function calls andcomponent instantiations, when usingnamed parameter instantiation (as op-posed to positional parameter instantia-tion), the syntax isformal => actual.In the problem, the function definition is:inv( a : std logic ) and the functioncall is: inv( a => b) . Here a is theformal argument andb is the actual argu-ment, so the correct function call would be:inv( a => b )

q2f architecture main of apatosaurus istype state_ty is (S0, S1, S2);signal st : state_ty;signal p : std_logic;

begincase st is

when S0 | S1 => p <= ’0’;when others => p <= ’1’;

end case;end main;

ILLEGAL: case statements are sequential;but the body of an architecture containsconcurrent statements.

P1.3. FLOPS, LATCHES, AND COMBINATIONAL CIRCUITRY 7

P1.3 Flops, Latches, and Combinational Circuitry

For each of the signalsp...z in the architecturemain of montevido , answer whether the signalis a latch, combinational gate, or flip-flop.

entity montevido isport (

a, b0, b1, c0, c1, d0, d1, e0, e1 : in std_logic;l : in std_logic_vector (1 downto 0);p, q, r, s, t, u, v, w, x, y, z : out std_logic

);end montevido;

architecture main of montevido issignal i, j : std_logic;

begini <= c0 XOR c1;j <= c0 XOR c1;process (a, i, j) begin

if (a = ’1’) thenp <= i AND j;

elsep <= NOT i;

end if;end process;process (a, b0, b1) begin

if rising_edge(a) thenq <= b0 AND b1;

end if;end process;

process(a, c0, c1, d0, d1, e0, e1)

beginif (a = ’1’) then

r <= c0 OR c1;s <= d0 AND d1;

elser <= e0 XOR e1;

end if;end process;

process beginwait until rising_edge(a);t <= b0 XOR b1;u <= NOT t;v <= NOT x;

end process;

process begincase l is

when "00" =>wait until rising_edge(a);w <= b0 AND b1;x <= ’0’;

when "01" =>wait until rising_edge(a);w <= ’-’;x <= ’1’;

when "1-" =>wait until rising_edge(a);w <= c0 XOR c1;x <= ’-’;

end case;end process;y <= c0 XOR c1;z <= x XOR w;

end main;


Answer:Latch Combinational Flip-flop

p Xq Xr Xs Xt Xu Xv Xw Xx Xy Xz X

Explanation of why e, which is the output of a flip-flop, have a value at 5ns,which is before the first rising edge of the clock.

Before the first rising edge of the clock, the following assignments will allhappen:

a <= ’0’;b <= ’0’;...----------- end of delta cycled <= ’0’...----------- end of delta cyclee <= d;

If you were to implement VHDL code in hardware, e would be the output of aflop, and as such would remain as ’U’ until the first rising edge of the clock.This is a situation where simulating the VHDL code will have slightly differentresults than simulating the hardware. Most questions in ece327 that ask youto compare the behaviour of VHDL code with the behaviour of a circuit willsay to focus on the steady-state behaviour and ignore any differences in thefirst few clock cycles.

P1.4. COUNTING CLOCK CYCLES 9

P1.4 Counting Clock Cycles

This question refers to the VHDL code shown below.

NOTES:1. “... ” represents a legal fragment of VHDL code

2. assume all signals are properly declared

3. the VHDL code is intendend to be legal, synthesizable code

4. all signals are initially’U’


entity bigckt isport (


);end bigckt;

architecture main of bigckt isbegin

process (a, b)begin

if (a = ’0’) thenc <= ’0’;

elseif (b = 1’) then

c <= ’1’else

c <= ’0’;end if;

end if;end process;

end main;

entity tinyckt isport (

clk : in std_logic;i : in std_logic;o : out std_logic

);end tinyckt;

architecture main of tinyckt iscomponent bigckt ( ... );signal ... : std_logic;

beginp0 : process begin

wait until rising_edge(clk);p0_a <= i;wait until rising_edge(clk);

end process;p1 : process begin

wait until rising_edge(clk);p1_b <= p1_d;p1_c <= p1_b;p1_d <= s2_k;

end process;

p2 : process (p1_c, p3_h, p4_i, clk) beginif rising_edge(clk) then

p2_e <= p3_h;p2_f <= p1_c = p4_i;

end if;end process;

p3 : process (i, s4_m) beginp3_g <= i;p3_h <= s4_m;

end process;

p4 : process (clk, i) beginif (clk = ’1’) then

p4_i <= i;else

p4_i <= ’0’;end if;

end process;

huge : bigckt(a => p2_e, b => p1_d, c => h_y);

s1_j <= s3_l;s2_k <= p1_b XOR i;s3_l <= p2_f;s4_m <= p2_f;

end main;

For each of the pairs of signals below, what is theminimum length of time between when achange occurs on the source signal and when that change affects the destination signal?

P1.5. ARITHMETIC OVERFLOW 11

Answer:

s2_k p1_d p1_b p1_c

clk

’0’

ip2_f s4_m p3_h

p2_ep4_i

NOTE: i doesn’t affect the value of p4 i just before a rising edge of clock, soi doesn’t affect p2 e at all along the path that goes through p4 i.

p1_c

clk

’0’

p4_ip2_f

i

p1_c

clk

p4_i

p2_f

i

α β ψ

δ ε φ

ε φ

α β

src dst Num clock cyclesi p0 a 1 clock cyclei p1 b 2 clock cyclesi p1 c 3 clock cyclesi p2 e 5 clock cyclesi p3 g same clock cyclei p4 i same clock cycles4 m h y 1 clock cyclep1 b p1 d 1 clock cyclep2 f s1 j same clock cyclep2 f s2 k no connection

P1.5 Arithmetic OverflowImplement a circuit to detect overflow in 8-bit signed addition.

An overflow in addition happens when the carryinto the most significant bit is different from thecarryout of the most significant bit.


When performing addition, for overflow to happen, both operands must have the same sign.Positive overflow occurs when adding two positive operands results in a negative sum. Negativeoverflow occurs when adding two negative operands results ina positive sum.

Answer:

We use xor to check if two bits are not-equal.


entity overflow isport (

num1,num2 : in signed(7 downto 0);cin : in std_logic;overflow : out std_logic

);end overflow;

architecture main of overflow issignal result : signed(7 downto 0);

beginresult <= num1 + num2 + ("0000000" & cin);ovrflw <= not (num1(7) xor num2(7))

and ( num1(7) xor result(7) );end overflow;

P1.6. DELTA-CYCLE SIMULATION: PONG 13

P1.6 Delta-Cycle Simulation: Pong


INSTRUCTIONS:1. The simulation is to be done at the granularity of simulation-steps.


3. Each column of the timing diagram corresponds to a simulation step that changes a signalor process.


5. End your simulation just before 20 ns.

architecture main of pong_machine issignal ping_i, ping_n, pong_i, pong_n : std_logic;

begin

reset_proc: processreset <= ’1’;wait for 10 ns;reset <= ’0’;wait for 100 ns;

end process;

clk_proc: processclk <= ’0’;wait for 10 ns;clk <= ’1’;wait for 10 ns;

end process;

next_proc: process (clk)begin

if rising_edge(clk) thenping_n <= ping_i;pong_n <= pong_i;

end if;end process;

comb_proc: process (pong_n, ping_n, reset)begin

if (reset = ’1’) thenping_i <= ’1’;pong_i <= ’0’;

elseping_i <= pong_n;pong_i <= ping_n;

end if;end process;

end main;

Answer:


t=0ns

reset

clk

ping_i

ping_n

pong_i

pong_n

next_proccomb_proc

clk_procreset_proc A

SAS

SS

A

PPPP

U

U

U

U

U

U

delta cyclesim cyclesim round B

BB

A

U

U

PP

EE

SA

1

0

A S

t=10ns

PP SA

1

0

SA

1

0

PP

EBB

B/E

B/E

BB

A

E

EE

t=0ns+1δ t=10ns+1δ t=10ns+2δ

BB

U

U

U

U

U

U

P1.7 Delta-Cycle Simulation: Baku


INSTRUCTIONS:1. The simulation is to be done at the granularity of simulation-steps.


3. Each column of the timing diagram corresponds to a simulation step.


5. Write “t=5ns” and “t=10ns” at the top of columns where timeadvances to 5 ns and 10 ns.

6. Begin your simulation at 5 ns (i.e. after the initial simulation cycles that initialize thesignals have completed).

7. End your simulation just before 15 ns;

P1.7. DELTA-CYCLE SIMULATION: BAKU 15

entity baku isport (

clk, a, b : in std_logic;f : out std_logic

);end baku;

architecture main of baku issignal c, d, e : std_logic;

begin

proc_clk: processbegin

clk <= ’0’;wait for 10 ns;clk <= ’1’;wat for 10 ns;

end process;proc_extern : processbegin

a <= ’0’;b <= ’0’;wait for 5 ns;a <= ’1’;b <= ’1’;wait for 15 ns;

end process;

proc_1 : process (a, b, c)begin

c <= a and b;d <= a xor c;

end process;proc_2 : processbegin

e <= d;wait until rising_edge(clk);

end process;proc_3 : process (c, e) begin

f <= c xor e;end process;

end main;

Answer:

16

CH

AP

TE

R1.

VH

DL

PR

OB

LEM

S

clk

delta cyclesimulation cyclesimulation round

a

b

c

d

e

f

proc_1

proc_2

proc_3

P A S P

P

A S

A S

P A S

B

B

B

B B

B

B

E

E

E

E E

E

E

E

proc_clk

A S

P A S

t=5 ns t=10 ns

E

E B

proc_extern P

1

1

1

1 0

B/E

B/E

B

P

B/E

B/E

B

t=15 ns

U

U U

EB

P A S

0U

U 1

EB

Note: the instruction to end just before 15 ns simply causes us to stop the simu-lation just before 15 ns. The values on the signals at the end of 10 ns will remainuntil the next event, which happens at 20 ns.

P1.8. CLOCK-CYCLE SIMULATION 17

P1.8 Clock-Cycle Simulation

Given the VHDL code foranapurna and waveform diagram below, answer what the values ofthe signalsy , z , andp will be at the given times.

entity anapurna isport (

clk, reset, sel : in std_logic;a, b : in unsigned(15 downto 0);p : out unsigned(15 downto 0)

);end anapurna;

architecture main of anapurna istype state_ty is (mango, guava, durian, papaya);signal y, z : unsigned(15 downto 0);signal state : state_ty;

begin


proc_herzog: processbegin

top_loop: loopwait until (rising_edge(clk));next top_loop when (reset = ’1’);state <= durian;wait until (rising_edge(clk));state <= papaya;while y < z loop

wait until (rising_edge(clk));if sel = ’1’ then

wait until (rising_edge(clk));next top_loop when (reset = ’1’);state <= mango;

end if;state <= papaya;

end loop;end loop;

end process;

proc_hillary: process (clk)begin

if rising_edge(clk) thenif (state = durian) then

z <= a;else

z <= z + 2;end if;

end if;end process;y <= b;p <= y + z;

end main;

Answer:

P1.8. CLOCK-CYCLE SIMULATION 19

0 20 40 60 80 100 120 140 160 180 200

reset

clk

b 0F 03 0D 05 0B 07 01

sel

a 01 0E 02 0C 0A 06 080E

z

state

p

55ns 107ns 147ns 195ns

y

0F 03 0D 05 0B 07 010F 03 0D 05 0B 07

02 0C 0A 06 08 0E 02 0C04 04 04 0A

U U D P D P D P P P

U U 2 4 6 8 A

0F 03 0D 05 0B 07 01 0F 03 0D 05 0B 07 01 0F 03 0D 05 0B07

U A C

U 07 15 11

55ns 107ns 147ns 195nsy 7 5 F 7z U 2 6 Ap U 7 15 11


P1.9 VHDL — VHDL Behavioural Comparison: Teradactyl

For each of the VHDL architecturesq3a throughq3c , does the signalv have the same behaviouras it does in themain architecture ofteradactyl ?




entity teradactyl isport (

a : in std_logic;v : out std_logic

);end teradactyl;architecture main of teradactyl is

signal m : std_logic;begin

m <= a;v <= m;

end main;

architecture q3a of teradactyl issignal b, c, d : std_logic;

beginb <= a;c <= b;d <= c;v <= d;

end q3a;

SAME - Intermediate signals are optimized out.

architecture q3b of teradactyl issignal m : std_logic;

beginprocess (a, m) begin

v <= m;m <= a;


SAME - Putting it in a process doesn’t matter.

architecture q3c of teradactyl issignal m : std_logic;

beginprocess (a) begin

m <= a;end process;process (m) begin

v <= m;end process;

end q3c;

SAME - Putting it in a seperate process doesn’tmatter due to the parallel nature of VHDL.

P1.10. VHDL — VHDL BEHAVIOURAL COMPARISON: ICHTYOSTEGA 21

P1.10 VHDL — VHDL Behavioural Comparison: Ichtyostega

For each of the VHDL architecturesq4a throughq4c , does the signalv have the same behaviouras it does in themain architecture ofichthyostega ?




entity ichthyostega isport (

clk : in std_logic;b, c : in signed(3 downto 0);v : out signed(3 downto 0)

);end ichthyostega;

architecture main of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin



wait until (rising_edge(clk));if (cx > 0) then

v <= bx;else



architecture q4a of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin



if (cx > 0) thenwait until (rising_edge(clk));v <= bx;

elsewait until (rising_edge(clk));v <= to_signed(-1, 4);

end if;end process;

end q4a;

DIFFERENT: evaluations of cx > 0 andv <= bx are separated by a clock cycle.


architecture q4b of ichthyostega issignal bx, cx : signed(3 downto 0);

beginprocess begin

wait until (rising_edge(clk));bx <= b;cx <= c;wait until (rising_edge(clk));if (cx > 0) then

v <= bx;else



DIFFERENT: each assignment state-ment (e.g. bx <= b ) will execute everyother clock cycle, rather than every clockcycle.

architecture q4c of ichthyostega issignal bx, cx, dx : signed(3 downto 0);

beginprocess begin



wait until (rising_edge(clk));v <= dx;

end process;dx <= bx when (cx > 0)

else to_signed(-1, 4);end q4c;

SAME

P1.11. WAVEFORM — VHDL BEHAVIOURAL COMPARISON 23

P1.11 Waveform — VHDL Behavioural Comparison

Answer whether each of the VHDL code fragmentsq3a throughq3d has the same behaviour asthe timing diagram.

NOTES: 1) “Same behaviour” means that the signalsa, b, andc have the same values atthe end of each clock cycle in steady-state simulation (ignore any irregularitiesin the first few clock cycles).

2) For full marks, if the code does not match, you must explain why.3) Assume that all signals, constants, variables, types, etc are properly defined

and declared.4) All of the code fragments are legal, synthesizable VHDL code.

clk

a

b

c

q3aarchitecture q3a of q3 isbegin

process begina <= ’1’;loop

wait until rising_edge(clk);a <= NOT a;

end loop;end process;b <= NOT a;c <= NOT b;

end q3a;

SAME

q3b

architecture q3b of q3 isbegin

process beginb <= ’0’;a <= ’1’;wait until rising_edge(clk);a <= b;b <= a;wait until rising_edge(clk);

end process;c <= a;

end q3b;

SAME


q3carchitecture q3c of q3 isbegin

process begina <= ’0’;b <= ’1’;wait until rising_edge(clk);b <= a;a <= b;wait until rising_edge(clk);

end process;c <= NOT b;

end q3c;

SAME

q3darchitecture q3d of q3 isbegin

process (b, clk) begina <= NOT b;

end process;process (a, clk) begin

b <= NOT a;end process;c <= NOT b;

end q3d;

DIFFERENT: this code has combinationalloops

q3earchitecture q3e of q3 isbegin

processbegin

b <= ’0’;a <= ’1’;wait until rising_edge(clk);a <= c;b <= a;wait until rising_edge(clk);

end process;c <= not b;

end q3e;

DIFFERENT: a is a constant ’1’

q3f

architecture q3f of q3 isbegin

process begina <= ’1’;b <= ’0’;c <= ’1’;wait until rising_edge(clk);a <= c;b <= a;c <= NOT b;wait until rising_edge(clk);

end process;end q3f;

DIFFERENT: a and c are constant 1

P1.12. HARDWARE — VHDL COMPARISON 25

P1.12 Hardware — VHDL Comparison

For each of the circuits q2a–q2d, answerwhether the signald has the same behaviouras it does in the main architecture of q2.

entity q2 isport (

a, clk, reset : in std_logic;d : out std_logic

);end q2;architecture main of q2 is

signal b, c : std_logic;begin

b <= ’0’ when (reset = ’1’)else a;


c <= b;d <= c;

end if;end process;

end main;

q2a clk

a

0

reset

d

q2b clk a

0

reset

d


q2c clk a

0

reset

d

q2d clk

a

0

reset

d

clk

Answer:q2a: a shouldn’t be flopped. q2b: One too many FFs. q2c: Correct

operation. q2d: This will work (i.e. it has the same input-outputcharacteristics) but the internal description is different.

P1.13. 8-BIT REGISTER 27

P1.13 8-Bit Register

Implement an 8-bit register that has:• clock signalclk

• input data vectord

• output data vectorq

• synchronous active-high inputreset

• synchronous active-high inputenable

Answer:

library ieee;use ieee.std_logic_1164.all;

entity reg_8 isport (

clk,reset,enable : in std_logic;d : in std_logic_vector (7 downto 0);q : out std_logic_vector (7 downto 0)

);end reg_8;

architecture main of reg_8 isbegin

reg: processbegin

wait until (rising_edge(clk));if reset = ’1’ then

q <= (others => ’0’);elsif enable = ’1’ then

q <= d;end if;

end process reg;end main;

P1.13.1 Asynchronous Reset

Modify your design so that thereset signal is asynchronous, rather than synchronous.


Answer:

reg : process(clk, reset)begin

if reset = ’1’ thenq <= (other => ’0’);

elsif rising_edge(clk) thenif enable = ’1’ then

q <= d;end if;

end if;end process reg;

P1.13.2 Discussion

Describe the tradeoffs in using synchonous versus asynchronous reset in a circuit implemented onan FPGA.

Answer:Synchronous resets lead to more robust designs. With an asynchronous

reset, a flop is reset whenever the reset signal arrives. Due to wire delays,signals will arrive at different flops at different times. If an asynchronous resetoccurs at about the same time as a clock edge, some flops might be reset inone clock cycle and some in the next. This can lead to glitches and/or illegalvalues on internal state signals.

The tradeoff is that asynchronous reset is often easier to code in VHDL andrequires less hardware to implement.

P1.13.3 Testbench for Register

Write a test bench to validate the functionality of the 8-bitregister with synchronous reset.

Answer:

P1.13.3 Testbench for Register 29


entity reg_8_tb isend reg_8_tb;

architecture main of reg_8_tb iscomponent reg_8 is

port (clk : in std_logic;reset : in std_logic;enable : in std_logic;d : in std_logic_vector (7 downto 0);q : out std_logic_vector (7 downto 0);

end component;signal clk, reset, enable : std_logic;signal d, q : std_logic_vector(7 downto 0);

begin

uut : reg_8 port map( clk => clk,

reset => reset,enable => enable,d => d,q => q

);

process beginclk <= ’1’ ; reset <= ’0’ ;wait for 20 ns; -- time=20 nsclk <= ’0’ ; reset <= ’1’ ; enable <= ’1’ ; d <= "10101011";wait for 20 ns; -- time=40 nsclk <= ’1’ ; reset <= ’0’ ;wait for 20 ns; -- time=60 nsclk <= ’0’ ; enable <= ’0’ ; d <= "00001011";wait for 20 ns; -- time=80 nsclk <= ’1’ ;wait for 20 ns; -- time=100 nsclk <= ’0’ ; enable <= ’1’ ;wait for 20 ns; -- time=120 nsclk <= ’1’ ;wait for 20 ns; -- time=140 ns



P1.14 Synthesizable VHDL and Hardware

For each of the fragments of VHDL q4a...q4f, answer whether the the code is synthesizable. If thecode is synthesizable, draw the circuit most likely to be generated by synthesizing the datapath ofthe code. If the the code is not synthesizable, explain why.

q4a

process beginwait until rising_edge(a);e <= d;wait until rising_edge(b);e <= NOT d;

end process;

Answer:Unsynthesizable : different

conditions in wait statements insame process. This would leadto a single flip-flop requiringmultiple clock signals.

q4b

process beginwhile (c /= ’1’) loop

if (b = ’1’) thenwait until rising_edge(a);e <= d;

elsee <= NOT d;

end if;end loop;e <= b;

end process;

Answer:unsynthesizable : while loop

around code where some pathshave wait statements and somedo not. Even having a whileloop with a dynamic conditionaround code without a waitstatement would beunsynthesizable, because itwould lead to combinationalloops in the hardware.

q4c

process (a, d) begine <= d;

end process;process (a, e) begin

if rising_edge(a) thenf <= NOT e;

end if;end process;

Answer:Flop with inverter on input

de

a

f

q4d

process (a) beginif rising_edge(a) then

if b = ’1’ thene <= ’0’;

elsee <= d;

end if;end if;

end process;

P1.14. SYNTHESIZABLE VHDL AND HARDWARE 31

Answer:Synchronous reset (AND with

bubble). The Reset pin on aflip-flop is generallyasynchronous , so a flop with areset pin would be incorrect.

b e

a

d

q4e

process (a,b,c,d) beginif rising_edge(a) then

e <= c;else

if (b = ’1’) thene <= d;

end if;end if;

end process;

Answer:Unsynthesizable : An if

rising edge with else clause isunsynthesizable because itrequires a signal (the selectsignal for the multiplexer) todetect a rising edge.

q4f

process (a,b,c) beginif (b = ’1’) then

e <= ’0’;else

if rising_edge(a) thene <= c;

end if;end if;

end process;

Answer:Flop with asynchronous reset.

b

e

a

R

c


P1.15 Datapath Design

Each of the three VHDL fragments q4a–q4c, is intended to be the datapath for the same circuit.The circuit is intended to perform the following sequence ofoperations (not all operations arerequired to use a clock cycle):

• read in source and destination addresses fromi src1 ,i src2 , i dst

• read operandsop1 andop2 from memory

• compute sum of operandssum

• write sum to memory at destination addressdst

• write sum to outputo result

i_src1 i_src2 i_dst

o_resultclk

P1.15.1 Correct Implementation?

For each of the three fragments of VHDL q4a–q4c, answer whether it is a correct implementationof the datapath. If the datapath is not correct, explain why.If the datapath is correct, answer inwhich cycle you needload=’1’ .

NOTES:1. You may choose the number of clock cycles required to execute the sequence of operations.

2. The cycle in which the addresses are oni src1 , i src2 , andi dst is cycle #0.

3. The control circuitry that controls the datapath will output a signalload , which will be’1’ when the sum is to be written into memory.

4. The code fragment with the signal declaractions, connections for inputs and outputs, andthe instantiation of memory is to be used for all three code fragments q4a–q4c.

5. The memory has registered inputs and combinational (unregistered) outputs.

6. All of the VHDL is legal, synthesizable code.

P1.15.1 Correct Implementation? 33

-- This code is to be used for-- all three code fragments q4a--q4c.signal state : std_logic_vector(3 downto 0);signal src1, src2, dst, op1, op2, sum,

mem_in_a, mem_out_a, mem_out_b,mem_addr_a, mem_addr_b: unsigned(7 downto 0);

...process (clk)begin

if rising_edge(clk) thensrc1 <= i_src1;src2 <= i_src2;dst <= i_dst;o_result <= sum;

end if;end process;mem : ram256x16d

port map (clk => clk,i_addr_a => mem_addr_a,i_addr_b => mem_addr_b,i_we_a => mem_we,i_data_a => mem_in_a,o_data_a => mem_out_a,o_data_b => mem_out_b);


q4a

op1 <= mem_out_a when state = "0010"else (others => ’0’);

op2 <= mem_out_b when state = "0010"else (others => ’0’);

sum <= op1 + op2 when state = "0100"else (others => ’0’);

mem_in_a <= sum when state = "1000"else (others => ’0’);

mem_addr_a <= dst when state = "1000"else src1;

mem_we <= ’1’ when state = "1000"else ’0’;

mem_addr_b <= src2;process (clk)begin

if rising_edge(clk) thenif (load = ’1’) then

state <= "1000";else

-- rotate state vector one bit to leftstate <= state(2 downto 0) & state(3);

end if;end if;

end process;

Answer:The circuit is not correct: all of the signals are combinational.

Also, there could be initialization problems with state.

q4b


op1 <= mem_out_a;op2 <= mem_out_b;

end if;end process;sum <= op1 + op2;mem_in_a <= sum;mem_we <= load;mem_addr_a <= dst when load = ’1’

else src1;mem_addr_b <= src2;

P1.15.1 Correct Implementation? 35

Answer:

• The circuit is correct.

• load = ’1’ in clock cycle 30. inputs available1. src1; mem addr {a,b}2. mem out {a,b}3. op{1,2}; sum; mem in a; load


q4c

processbegin

wait until rising_edge(clk);op1 <= mem_out_a;op2 <= mem_out_b;sum <= op1 + op2;mem_in_a <= sum;

end process;process (load, dst, src1) begin

if load = ’1’ thenmem_addr_a <= dst;

elsemem_addr_a <= src1;

end if;end process;mem_addr_b <= src2;

Answer:If the code is taken exactly as is:

• the circuit is incorrect, because mem we is missing.

If assume that mem we is added:

• The circuit is correct.

• Need load = ’1’ in cycle 5.0. inputs available1. src1; mem addr {a,b}2. mem out {a,b}3. op{1,2}4. sum5. mem in a; load

P1.15.2 Smallest Area

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which willhave thesmallest area.

If you don’t have sufficient information to predict the relative areas, explain what additionalinformation you would need to predict the area prior to synthesizing the designs.

P1.15.3 Shortest Clock Period 37

Answer:Assuming that q4c includes mem we:

All of the circuits have an adder, memory, input flops, output flops, and a muxfor mem addr a. The differences are in the flops and misc circuitry:

For q4a, each of the signals op1, op2, sum, mem in a, and mem we isassigned either zero or the value of another signal, depending on the state.Because one of the inputs is a constant (0), we can implement this with anAND gate rather than a mux. Each bit of each signal requires one AND gate.We have five signals of eight bits each, therefore we need 5∗8 = 40 AND

gates.

q4a q4b q4cflops 1*4 2*8 4*8ands 5*8 0 0

From this analysis, q4a has the smallest area. There is the implicitassumption that an AND gate is much smaller than a FF.

P1.15.3 Shortest Clock Period

Of all of the circuits (q4a–q4c), including both correct and incorrect circuits, predict which willhave theshortest clock period.

If you don’t have sufficient information to predict the relative periods, explain what additionalinformation you would need to predict the period prior to performing any synthesis or timinganalysis of the designs.

Answer:

• Assuming that the memory is not on the critical path, q4c has the shortestclock period, because it does the least amount of computation between flipflops — all of the signals are flopped.

Chapter 2

Design Problems

P2.1 Synthesis

This question is about using VHDL to implement memory structures on FPGAs.

P2.1.1 Data Structures

If you have to write your own code (i.e. you do not have a library of memory components or aspecial component generation tool such as LogiBlox or CoreGen), what datastructures in VHDLwould you use when creating a register file?

P2.1.2 Own Code vs Libraries

When using VHDL for an FPGA, under what circumstances is it better to write your own VHDLcode for memory, rather than instantiate memory componentsfrom a library?

P2.2 Design Guidelines

While you are grocery shopping you encounter your co-op supervisor from last year. She’s nowforming a startup company in Waterloo that will build digital circuits. She’s writing up the designguidelines that all of their projects will follow. She asks for your advice on some potentialguidelines.

What is your response to each question?What is your justification for your answer?What are the tradeoffs between the two options?

39

40 CHAPTER 2. DESIGN PROBLEMS

0. SampleShould all projects usesilicon chips, or should all usebiological chips, or shouldeach project choose its own technique?

Answer: All projects should use silicon based chips, because biological chips don’texist yet. The tradeoff is that if biological chips existed, they would probablyconsume less power than silicon chips.

1. Should all projects use anasynchronous resetsignal, or should all use asynchronousresetsignal, or should each project choose its own technique?

Answer:Synchronous reset : Synchronous reset leads to more robust designs.

With asynchronous reset, a flop is reset whenever the reset signalarrives. Due to wire delays, signals will arrive at different flops at differenttimes. If an asynchronous reset occurs at about the same time as a clockedge, some flops might be reset in one clock cycle and others in the next.This can lead to glitches and/or illegal values on internal state signals.

The tradeoff is that asynchronous reset is often easier to code in VHDLand requires less hardware to implement.

2. Should all projects uselatches, or should all projects useflip-flops, or should each projectchoose its own technique?

Answer:Flops Flip flops lead to more robust designs than latches. Latches are

level sensitive and act as wires when enabled. For a latch based designto work correctly, there cannot be any overlap in the time when aconsecutive pair of latches are enabled. If this happens, the value on asignal will “leak” through the latch and arrive at the next set of latches oneclock phase too early. Thus, latch based designs are more sensitive tothe timing of clock signals. Another disadvantage of latches is that someFPGAs and cell libraries do not support them. In comparison, D-type flipflops are almost always supported.

The tradeoff is that latches are smaller and faster than flip flops. Acommon implementation of a flip-flop is a pair of latches in a master/slavecombination.

3. Should allchipshaveregisters on the inputs and outputsor should chips have theinputsand outputs directly connected to combinational circuitry, or should each projectchoose its own technique? By “register” we mean either flip-flops or latches, based uponyour answer to the previous question. If your answer is different for inputs and outputs,explain why.

P2.2. DESIGN GUIDELINES 41

Answer:Flops on outputs and inputs Putting flops on inputs and outputs will

make the clock speed of the chip less dependent of the propagationdelay between chips. Flops can also be used to isolate the internals ofthe chip from glitches and other anomolous behaviour that can occur onthe boards.

The tradeoff is that flops consume area and will increase the latencythrough the chip.

4. Should allcircuit modules on all chipshaveflip-flops on the inputs and outputsor shouldchips have theinputs and outputs directly connected to combinational circuitry , orshould each project choose its own technique? By “register”we mean either flip-flops orlatches, based upon your answer to the previous question. Ifyour answer is different forinputs and outputs, explain why.

Answer:Each project should adopt a convention of either using flops on inputs of

modules or outputs of modules. It is rarely necessary to put flops on bothinputs and outputs of modules on the same chip. This is because thewire delay between modules is usually less than a clock period. Puttingflops on either the inputs or outputs is advantageous because it providesa standard design convention that makes it easier to glue modulestogether without violating timing constraints. If modules were allowed tohave combinational circuitry on both inputs and outputs, the maximumclock speed of the design could not be determined until all of the moduleswere glued together.

The tradeoff is that flops add area and latency. Sometimes there will betwo modules where the combinational circuitry on the outputs of one canbe combined with the combinational circuitry on the inputs of the secondwithout violating timing constraints. This discipline prevents thatoptimization.

Aside: Sometimes, to meet performance targets, in situations such asthis, a project will remove or move the flops between modules and do“clock borrowing” to fit the maximum amount of circuitry into a clockperiod. This is a rather low-level optimization that happens late in thedesign cycle. It can cause big headaches for functional validation andequivalence verification, because the specifications for modules are nolonger clean and the boundaries between modules on the low-leveldesign might be different from the boundaries in the high-level design.

5. Should all projects usetri-state buffers, or should all projects usemultiplexors, or shouldeach project choose its own technique?


Answer:Multiplexors Multiplexors lead to more robust designs. Tri-state buffers

rely on analog characteristics of devices to work correctl, and can workincorrectly in the presence of voltage fluctuations or fabrication processvariations. Multiplexors work on a purely Boolean level and as such areless sensitive to changes in voltages or fabrication processes.

The tradeoff is that latches are smaller and faster than multiplexors. Itshould be noted that some designs require tri-state buffers, especiallycircuits that use a shared bus among many devices. Bi-directional pins,shared busses, and semiconductor memories all need tri-state buffers towork correctly.


Use the dataflow diagram below to answer problems P2.3.1 and P2.3.2.

f

f

a b c

d

g

f

g

e

P2.3.1 Resource Usage

List the number of items for each resource used in the dataflowdiagram.

P2.3.2 Optimization 43

Answer:

input ports 3output ports 1registers 4f components 2g components 1

P2.3.2 Optimization

Draw an optimized dataflow diagram that improves the performance and produces the sameoutput values. Or, if the performance cannot be improved, describe the limiting factor on thepreformance.

NOTES:


• you maynot increase the resource usage (input ports, registers, output ports, f components,g components)


Answer:

f

f

a b

c

d

g

f

g

e


P2.4 Dataflow Diagram Design

Your manager has given you the task of implementing the following pseudocode in an FPGA:

if is_odd(a + d)p = (a + d) * 2 + ((b + c) - 1)/4;

elsep = (b + c) * 2 + d;

NOTES: 1) You must use registers on all input and output ports.2) p, a, b, c , andd are to be implemented as 8-bit signed signals.3) A 2-input 8-bit ALU that supports both addition and subtraction takes 1

clock cycle.4) A 2-input 8-bit multiplier or divider takes 4 clock cycles.5) A small amount of additional circuitry (e.g. a NOT gate, anAND gate, or a

MUX) can be squeezed into the same clock cycle(s) as an ALU operation,multiply, or divide.

6) You can require that the environment provides the inputs in any order andthat it holds the input signals at the same value for multipleclock cycles.

P2.4.1 Maximum Performance

What is the minimum number of clock cycles needed to implement the pseudocode with a circuitthat has two input ports?

Answer:

Optimizations:

• Multiplication by a constant power of 2 can be done without hardware, justconnect the wires between the signals. For example, if we havea <= b* 2; , we can do this with a(0) <= b(1); a(1) <= b(2); etc.

• Testing if a signal is odd or even can be done simply by extracting the leastsignificant bit of the signal.

P2.4.1 Maximum Performance 45

b c

da

1

Data flow for odd caseb c

d

Data flow for even case

Even flow requires 4 clock cycles (3 cycles in the datapath plus one morebecause we have to have flops on both inputs and outputs). Therefore totaldesign will require 4 clock cycles.

What is the minimum number of ALUs, multipliers, and dividers needed to achieve the minimumnumber of clock cycles that you just calculated?

Answer:


b c

dda

-1xor

and

4 clock cycles

2 ALUs

0 dividers

0 multipliers

wired shiftleft by 1

Dataflow for entire circuit

P2.4.2 Minimum area

What is the minimum number of datapath storage registers (8,6, 4, and 1 bit) and clock cyclesneeded to implement the pseudocode if the circuit can have atmost one ALU, one multiplier, andone divider?

Answer:

P2.5. MICHENER: DESIGN AND OPTIMIZATION 47

b c

dda

-1

5 clock cycles

3 8b regs

0 6b regs

0 4b regs

dd

0 1b regs

and

wired shiftleft by 1

P2.5 Michener: Design and Optimization

Design a circuit named michener that performs the followingoperation:z = (a+d) + ((b -c) - 1)

NOTES:1. Optimize your design for area.

2. You may schedule the inputs to arrive at any time.

3. You may do algebraic transformations of the specification.

Answer:


z

a

+

+

d b c

1

Data-dependency graph

z

+

+

1

b

a

d

c

Dataflow diagram

P2.6 Dataflow Diagrams with Memory Arrays


NOTES:1. The inputs of the algorithms area andb.

2. The outputs of the algorithms arep andq.


4. You may choose to read your input data values at any time andproduce your outputs at anytime. For your inputs, you may read each value only once (i.e.the environment will notsend multiple copies of the same value).


6. Mis an internal memory array, which must be implemented as dual-ported memory withone read/write port and one read port.

P2.6.1 Algorithm 1 49

7. Msupports synchronous write and asynchronous read.

8. Assume all memory address and other arithmetic calculations are within the range ofrepresentable numbers (i.e. no overflows occur).


10. You may sacrifice area efficiency to achieve high performance, but marks will be deductedfor extra hardware that does not contribute to performance.

P2.6.1 Algorithm 1

Algorithm

q = M[b];M[a] = b;p = M[b+1] * a;

Assuminga ≤ b, draw a dataflow diagram that is optimized for the fastest overall executiontime.

Answer:

1. a ≤ b means that the addresses (a and b+1) are not equal to eachother, which allows writing to M[a] to be done in parallel with readingfrom M[b+1]. We must read from M[b] before we write to M[a],because it could be that b and a are the same address.

2. Initial dataflow diagram:

M(wr)

q p

a bM

M(rd)

1

M(rd)

M

3. Find the critical path


M(wr)

q

a bM

M(rd)

1

M(rd)

M p

25ns

60ns60ns

60ns65ns

150ns

Critical path is from b to p: 150ns.

4. Explore performance with different clock periods

M(wr)

q

a bM

M(rd)

1

M(rd)

M p

25ns

60ns60ns

60ns65ns

5ns

5ns

5ns

5ns

period 70 nslatency 4 cyclestime 280 ns

M(wr)

q

a bM

M(rd)

1

M(rd)

M

25ns

60ns60ns

60ns65ns

5ns

5ns

5nsp


5. Minimum latency is 3 clock cycles because we can’t do all memoryoperations in parallel and we need registers on both inputs and outputs.

6. Best performance is with a clock period of 90 ns.

7. Resource usage:Component Quantity

Input 1Output 1Register 5 (including mem array)Adder 1Memory read 2Memory write 1Multiplication 1Clock Period 90 nsLatency 3 cyclesExecution Time 270 ns

P2.6.2 Algorithm 2 51

P2.6.2 Algorithm 2

q = M[b];M[a] = q;p = (M[b-1]) * b) + M[b];

Assuminga > b, draw a dataflow diagram that is optimized for the fastest overall executiontime.

Answer:

1. a > b means that a 6= b and a 6= b-1, so there are no memoryaddress conflicts to create dependencies. There is a data-dependencythrough q from M[b] to M[a]. The resource constraint of the dual-portmemory array also prevents us from doing all three memory operationsin parallel.

2. Explore performance with different clock periods

M(wr)

q

p

a bM

M(rd)

1

M(rd)

M

30ns

60ns

65ns

5ns

5ns

5ns

5ns

25ns

60ns5ns


M(wr)

q

p

a bM

M(rd)

1

M(rd)

M

30ns

60ns

65ns

5ns

5ns

5ns

25ns

60ns


3. Area optimization: change b - 1 to b + (-1).


M(wr)

q

p

a bM

M(rd)

-1

M(rd)

M

25 ns

60ns

65ns

5ns

5ns

5ns

25ns

60ns

4. Resource usage:Component Quantity

Input 1Output 1Register 5 (including mem array)Adder 1Memory read 2Memory write 1Multiplication 1Clock Period 95 nsLatency 3 cyclesExecution Time 285 ns

P2.7 2-bit adder

This question compares an FPGA and generic-gates implementation of 2-bit full adder.

P2.7.1 Generic Gates

Show the implementation of a 2 bit adder using NAND, NOR, and NOT gates.

P2.7.2 FPGA

Show the implementation of a 2 bit adder using generic FPGA cells; show the equations for thelookup tables.

P2.8. SKETCHES OF PROBLEMS 53

CE

S

R D Q

c_in

comb

sum[0]

CE

S

R D Q comb

a[0] b[0]

a[1] b[1]

sum[1]

c_out

carry_1

P2.8 Sketches of Problems1. calculate resource usage for a dataflow diagram (input ports, output ports, registers,

datapath components)

2. calculate performance data for a dataflow diagram (clock period and number of cycles toexecute (CPI))

3. given a dataflow diagram, calculate the clock period that will result in the optimumperformance

4. given an algorithm, design a dataflow diagram

5. given a dataflow diagram, design the datapath and finite state machine

6. optimize a dataflow diagram to improve performance or reduce resource usage

7. given fsm diagram, pick VHDL code that “best” implements diagram — correct behaviour,simple, fast hardware — or critique hardware

Chapter 3

Functional Verification Problems

P3.1 Carry Save Adder1. Functionality Briefly describe the functionality of a carry-save adder.

2. TestbenchWrite a testbench for a 16-bit combinational carry save adder.

3. Testbench MaintenanceModify your testbench so that it is easy to change the width oftheadder and the latency of the computation.

NOTES:

(a) You do not need to support pipelined adders.

(b) VHDL “generics” might be useful.

P3.2 Traffic Light Controller

P3.2.1 Functionality

Briefly describe the functionality of a traffic-light controller that has sensors to detect thepresence of cars.

Answer:

Given a normal traffic light, which spends a constant amount of time as greenin each direction, add the following two transitions to the system:

1. If the less-busy road does not have any cars present for t1 minutes, thentransition the traffic light to make the busier of the two roads as green.

2. If the busy road has a car waiting for t2 minutes, then transition the trafficlight to make the busier of the two roads as green.

55

56 CHAPTER 3. FUNCTIONAL VERIFICATION PROBLEMS

P3.2.2 Boundary Conditions

Make a list of boundary conditions to check for your traffic light controller.

Answer:

1. A car arrives at the intersection and triggers the sensor, but makes aright turn before the light turns green in its direction. Should the light turnto green in the direction of the now vacant road, or stay green in thecurrent direction?

2. Same as 1, but the makes a right turn after the other road already has ayellow light. Should the light turn to green in the direction of the nowvacant road, or transition from yellow back to green, or very briefly staygreen in the vacant direction?

3. If the less-busy road is yellow, there’s no car at the busy road, and a cararrives at the less busy road. Same questions as the first two situations.

P3.2.3 Assertions

Make a list of assertions to check for your traffic light controller.

Answer:

1. if a light is green, the next colour will be yellow2. if a light is yellow, the next colour will be red3. if a light is red, the next colour will be green4. if no car has been at the less-busy road for at least t1 minutes then the

less-busy road is red.5. if the car sensor has been continuously “on” for the busy road for at least

t2 minutes then the busy road is green.

P3.3. STATE MACHINES AND VERIFICATION 57

P3.3 State Machines and Verification

P3.3.1 Three Different State Machines

s0 s1

s2s3

1/0

0/0 */0

*/0

*/1

Figure 3.1: A very simple machine

s0 s1

s3

s4

*/0s2

s8

s7

s9

s6

s5

*/0

*/0

*/0

*/0*/0

*/0

*/0

*/0

*/1

Figure 3.2: A very big machine

s0 s1

s2

*/0

*/0

*/0

*/1

q0 q1

q2

q4

*/0 */0

*/0

*/1

q3 */0

Figure 3.3: A concurrent machine

input/output

* = don’t care

Figure 3.4: Legend

Answer each of the following questions for the three state machines in figures3.1–3.3.

P3.3.1.1 Number of Test Scenarios

How many “test scenarios” (sequences of test vectors) wouldyou need to fully validate thebehaviour of the state machine?


P3.3.1.2 Length of Test Scenario

What is the maximum length (number of test vectors) in a test scenario for the state machine?

P3.3.1.3 Number of Flip Flops

Assuming that neither the inputs nor the outputs are registered, what is the minimum number offlip-flops needed to implement the state machine?

Answer:

scenarios max len min flopsFigure3.1 sequence expected behaviour

1) 000 s0, s2, s3, s02) 001 s0, s2, s3, s03) 010 s0, s2, s3, s04) 011 s0, s2, s3, s05) 1000 s0, s1, s2, s3, s06) 1001 s0, s1, s2, s3, s0...

12) 1111 s0, s1, s2, s3, s0

4 2

Figure3.2 sequence expected behaviour1) 0000000000 s0, s1, s2 ..., s9, s02) 0000000001 s0, s2, s2 ..., s9, s0

1024) 1111111111 s0, s1, s2 ..., s9, s0

10 4

Figure3.3 sequence expected behaviour1) 0...00 (s0,q0), (s1,q1),

(s2,q2), (s0,q3),(s1,q4), (s2,q0),(s0,q1), (s1,q2),(s2,q3), (s0,q4),(s1,q0), (s2,q1),(s0,q2), (s1,q3),(s2,q4), (s0,q0)

2) 0...01 same behaviour215) 1..11 same behaviour

15 5 or 4

For figure3.3, if we implement each machine separately we need 5 flops, 2 forthe S machine and 3 for the Q machine. If we merge the state machines, weneed log2(3×5) = 4 flops.

One of the purposes of this exercise is to illustrate how many test vectors itrequires to exhaustively test the behaviour of even simple circuits. Also, this

P3.3.2 State Machines in General 59

demonstrates how the structure of a circuit affects the number of test vectorsneeded. Size alone is not the determining factor.

P3.3.2 State Machines in General

If a circuit hasi signals of 1-bit each that are inputs,f 1-bit signals that are outputs of flip-flopsandc 1-bit signals that are the outputs of combinational circuitry, what is the maximum numberof states that the circuit can have?

Answer:

The maximum number of states for a circuit with i inputs and f flops is 2i+ f .The values of combinational signals are determined by the flops and theinputs, and so they don’t contribute to the total number of states. Each outputis either a combinational signal or the output of a flip flop, so the outputs aresubsumed by the combinational and flopped signals.

P3.4 Test Plan Creation

You’re on the functional verification team for a chip that will control a simple portable CD-player.Your task is to create a plan for the functional verification for the signals in the entitycd digital .

You’ve been told that the player behaves “just like all of theother CD players out there”. If yourtest plan requires knowledge about any potential non-standard features or behaviour, you’ll needto document your assumptions.

pwr

track min

prev nextstop play

sec


entity cd_digital isport (

--------------------------------------------------- --- buttonsprev,stop,play,next,pwr : in std_logic;--------------------------------------------------- --- detect if player door is openopen : in std_logic;--------------------------------------------------- --- output display informationtrack : out std_logic_vector(3 downto 0);min : out unsigned(6 downto 0);sec : out unsigned(5 downto 0)

);end cd_digital;

P3.4.1 Early Tests

Describe five tests that you would run as soon as the VHDL code is simulatable. For each test:describe what your specification, stimulus, and check. Summarize the why your collection oftests should be the first tests that are run.

Answer:

test1

specification when power is turned on, the display will show the numberof tracks on the CD, and the minutes and seconds will show the totallength of the CD.

stimulus power=’0’; wait; power=’1’, all other signals are ’0’.check display outputs of circuit match specification

test2

specification when power is on, play starts CD playing, display fortrack=1, min and sec show remaining time for song and startdecrementing.

stimulus power=’1’; play=’0’; wait; play=’1’, all othersignals are ’0’.

check display outputs of circuit match specification

test3

P3.4.2 Corner Cases 61

specification when power is on and CD is playing, next starts next song.Display for track increments, min and sec show remaining time fornext song and start decrementing.

stimulus power=’1’; play=’0’; next=’0’; wait;play=’1’; wait; next=’1’, all other signals are ’0’.

check display outputs of circuit match specificationtest4

specification when power is on and CD is playing, prev starts previoussong. Display for track decrements, min and sec show remainingtime for previous song and start decrementing.

stimulus power=’1’; play=’0’; prev=’0’; wait;play=’1’; wait; prev=’1’, all other signals are ’0’.

check display outputs of circuit match specificationtest5

specification when power is on and CD is playing, stop causes CD tostop.

stimulus power=’1’; play=’0’; stop=’0’; wait;play=’1’; wait; stop=’1’, all other signals are ’0’.

check display outputs of circuit match specificationjustification for choices

These cases test the basic operations of the CD player. Each testfocusses on a different aspect of the player’s behaviour.

P3.4.2 Corner Cases

Describe fivecorner-casesor boundary conditions, and explain the role of corner cases andboundary conditions in functional verification.

NOTES:1. You may reference your answer for problem P4.4.1 in this question.

2. If you do not know what a “corner case” or “boundary condition” is, you may earn partial

credit by: checking this box and explaining five things that you would do in functionalverification.

Answer:

case 1 : press both prev and next while a CD is playingcase 2 : open the case while a CD is playingcase 3 : press play and stop at the same timecase 4 : press any button other than power when the player is offcase 5 : press next repeatedly until track counter wraps aroundrole of corner cases : The purpose of corner cases is to test unusual

situations that designers might not have thought of, and so are morelikely to contain bugs than normal behaviour.


P3.5 Sketches of Problems1. Given a circuit, VHDL code, or circuit size info; calculate simulation run time to achieve

n% coverage.

2. Given a fragment of VHDL code, list things to do to make it more robust — e.g. illegal dataand states go to initial state.

3. Smith Problem 13.29

Chapter 4

Performance Analysis and OptimizationProblems

P4.1 Farmer

A farmer is trying to decide which of his two trucks to use to transport his apples from his orchardto the market.

Facts:

capacity oftruck

speed whenloaded with

apples

speed whenunloaded (noapples)

big truck 12 tonnes 15kph 38kphsmall truck 6 tonnes 30kph 70kph

distance to market 120 kmamount of apples 85 tonnes

NOTES:

1. All of the loads of apples must be carried using the same truck

2. Elapsed time is counted from beginning to deliver first load to returning to the orchard afterthe last load

3. Ignore time spent loading and unloading apples, coffee breaks, refueling, etc.

4. For each trip, a truck travels either its fully loaded or empty speed.

Question: Which truck will take the least amount of time and what percentage fasterwill the truck be?

63

64 CHAPTER 4. PERFORMANCE ANALYSIS AND OPTIMIZATION PROBLEMS

Answer:

TimeTot = NumTrips× (TimeLoaded +TimeUnloaded)

NumTrips = ⌈Harvest/Capacity⌉All trips are for the same distance, so distance cancels out ofthe equations:

Time ∝ 1/SpeedTimeTotBig ∝ ⌈85/12⌉× (1/15+1/38)

∝ 8×0.0930∝ 0.7439

TimeTotSmall ∝ ⌈85/6⌉× (1/30+1/70)∝ 15×0.0477∝ 0.7143

Small truck will take less time

PctFaster = TimeSlow−TimeFastTimeFast

=TimeTotBig−TimeTotSmall

TimeTotSmall

= 0.7439−0.71430.7143

= 4.15%

Question: In planning ahead for next year, is there anything the farmercould do todecrease his delivery time with little or no additional expense? If so, what is it, if not,explain.

Answer:Use two drivers

Use a combination of the small truck and large truck to improve his utilization.

P4.2 Network and Router

In this question there is a network that runs a protocol called BigLan. You are designing a routercalled the DataChopper that routes packets over the networkrunning BigLan (i.e. they’re BigLanpackets).

P4.2.1 Maximum Throughput 65

The BigLan network protocol runs at a data rate of 160 Mbps (Megabits per second). EachBigLan packet contains 100 Bytes of routing information and1000 Bytes of data.

You are working on the DataChopper router, which has the following performance numbers:

75MHz clock speed4 cycles for a byte of either data or header500 number of additional clock cycles to process the routinginformation

for a packet

P4.2.1 Maximum Throughput

Which has a higher maximum throughput (as measured indata bits per second — that is only thepayload bits count as useful work), the network or your router, and how much faster is it?

Answer:

Data throughput can be thought of as useful data / time. So, often in thesetypes of questions you will have to do the following:

total data/time * useful data/total data.

The maximum data throughput of the two technologies in terms of bits can becalculated as follows:

1. BigLan Network ProtocolMaximum data throughput = 160 Mbps * (8000 useful data bits per packet / 8800 total data

= 145.45 Mbps

2. DataChopper RouterTime required for a packet = 500 clock cycles

+ 0.5 CPI per data bit * 8800 packet bits= 500 clock cycles + 4400 clock cycles= 4900 clock cycles= 4900 clock cycles * 13.33 ns per cycle= 65333 ns per packet

Time required for a data bit = 65333 ns per packet / 8000 data bits= 8.167 ns per data bit

Maximum data throughput = 1 / 8.167 ns per data bit= 122.46 Mbps

You could also use the previous method: = cycles/sec * total bytes/cycle * useful bytes/total


The network has a higher maximum throughput.

What percentage higher?

n% higher performance = (perf high - perf low) / perf low= (145 - 122)/122= 19%

The network has 19% higher maximum performance. Therefore, the routercan’t keep up with the network.

P4.2.2 Packet Size and Performance

Explain the effect of an increase in packet length on the performance of the DataChopper (asmeasured in the maximum number of bits per second that it can process) assuming the headerremains constant at 100 bytes.

Answer:

As packet size increases, the overhead associated with the constant routingdelay will become less significant.

The data rate of the router will slowly approach that of the network but it willnever surpass the network throughput. If there was not any overhead forrouting, the peak data rate for the router would be 150 Mbps compared to 160Mbps of the network.

It shoud be noted that even though a giant packet size would seem like anideal solution in this question, in reality lost packets, latency, and small datasizes would make this impractical. For example, if each packet was 1 GB andthe network was transmitting a cell-phone conversation, you would have towait a very long time for the first packet to arrive before you could hear theother person. Also, if a packet was lost, you’d have to wait a long time to seeif the other person is still on the phone.

P4.3 Performance Short Answer

If performance doubles every two years, by what percentage does performance go up everymonth? This question is similar to compound growth from youreconomics class.

P4.4. MICROPROCESSORS 67

Answer:

P = 2t/24(where t is measured in months)

= 21/24

= 1.029

Therefore, performance goes up by 2.9% each month.

P4.4 Microprocessors

TheYmemicroprocessor is very small and inexpensive. One performance sacrifice the designershave made is to not include a multiply instruction. Multiplies must be written in software usingloops of shifts and adds.

TheYmecurrently ships at a clock frequency of 200MHz and has an average CPI of 4.

A competitor sells theY!v1 microprocessor, which supports exactly the same instructions as theYme. TheY!v1 runs at 150MHz, and the average program is 10% faster on theYmethan it is ontheY!v1 .

P4.4.1 Average CPI

Question: What is the average CPI for theY!v1 ? If you don’t have enoughinformation to answer this question, explain what additional information you needand how you would use it?

Answer:

Use the following subscripts: Yme 1Y!v1 2Y!u2 3

The Ymeis 10% faster than the Y!v1 .

NumInst2 = NumInst1ClockSpeed1 = 200MHzClockSpeed2 = 150MHz

CPI1 = 4

Solve for CPI2.


Time =NumInst×CPIClockSpeed

Time2−Time1

Time1= 0.10

Time2

Time1= 1.10

Time2 = 1.10×Time1

NumInst2×CPI2ClockSpeed2

= 1.10×NumInst1×CPI1

ClockSpeed1

CPI2 = 1.10×ClockSpeed2×NumInst1×CPI1

NumInst2×ClockSpeed1

= 1.10×ClockSpeed2×CPI1

ClockSpeed1

= 1.10×150MHz×4

200MHz

= 3.3

Common mistakes:

• Swapping performance of Ymeand Y!v1 .

A new version of theY! , theY!u2 has just been announced. TheY!u2 includes a multiplyinstruction and runs at 180MHz. TheY!u2 publicity brochures claim that using their multiplyinstruction, rather than shift/add loops, can eliminate 10% of the instructions in the averageprogram. The brochures also claim that the average performance ofY!u2 is 30% better than thatof theY!v1 .

P4.4.2 Why not you too?

Question: Assuming the advertising claims are true, what is the average CPI for theY!u2 ? If you don’t have enough information to answer this question, explain whatadditional information you need and how you would use it?

P4.4.3 Analysis 69

Answer:

1.3×Time3 = Time2

1.3×NumInst3×CPI3

ClockSpeed3=

NumInst2×CPI2ClockSpeed2

Solve forCPI3 :

CPI3 =ClockSpeed3×NumInst2×CPI21.3×NumInst3×ClockSpeed2

=180MHz×3.3

1.3×0.9×150MHz

= 3.38

Common mistakes:

• Comparing performance of Y!u2 to Yme, rather than Y!v1 .

• Saying that time for Y!u2 is 70% of Y!v1 .

• Forgeting to take into account reduced number of instructions.

P4.4.3 Analysis

Question: Which of the following do you think is most likely and why.

1. theY!u2 is basically the same as theY! v1 except for the multiply

2. theY!u2 designers made performance sacrifices in their design in order to include amultiply instruction

3. theY!u2 designers performed other significant optimizations in addition to creating amultiply instruction

Answer:The most likely analysis is that the Y!u2 is basically the same as the Y! v1

except for the multiply. This is because the Y!u2 has a slightly larger CPIthan the Y!v1 , this is in keeping with the addition of a multiply instruction. Amultiply instruction probably has a larger-than-average CPI.

The increase in clock speed likely comes from a new fabrication process, andwould not have required significant changes to the design of the chip.



Draw an optimized dataflow diagram that improves the performance and produces the sameoutput values. Or, if the performance cannot be improved, describe the limiting factor on theperformance.

NOTES:


• you maynot increase the resource usage (input ports, registers, output ports, f components, gcomponents)


f

f

a b c

d

g

f

g

e

Before Optimization

f

f

a b

c

d

g

f

g

e

After Optimization

P4.6 Performance Optimization with Memory Arrays

This question deals with the implementation and optimization for the algorithm and library ofcircuit components shown below.

Algorithmq = M[b];if (a > b) then

M[a] = b;p = (M[b-1]) * b) + M[b];

elseM[a] = b;p = M[b+1] * a;

end;


NOTES:

P4.6. PERFORMANCE OPTIMIZATION WITH MEMORY ARRAYS 71

1. 25% of the time,a > b

2. The inputs of the algorithm area andb.

3. The outputs of the algorithm arep andq.


5. You may choose to read your input data values at any time andproduce your outputs at anytime. For your inputs, you may read each value only once (i.e.the environment will notsend multiple copies of the same value).


7. Mis an internal memory array, which must be implemented as dual-ported memory withone read/write port and one write port.

8. Assume all memory address and other arithmetic calculations are within the range ofrepresentable numbers (i.e. no overflows occur).


10. Your dataflow diagram must include circuitry for computinga > b and using the result tochoose the value forp

Draw a dataflow diagram for each operation that is optimized for the fastest overall executiontime.

NOTE: You may sacrifice area efficiency to achieve high performance, but marks will bededucted for extra hardware that does not contribute to performance.

Answer:a > b (25%)

q = M[b];M[a] = b;p = (M[b-1] * b) + M[b];

a ≤ b (75%)q = M[b];M[a] = b;p = M[b+1] * a;

1. a ≤ b happens 75% of the time, so initially focus on common case.

(a) a ≤ b means that a 6= b+1, therefore can do M[b+1] read inparallel with M[a] write or with M[b] read.

(b) But, could have a= b, so can’t do M[a] write in parallel with M[b]read.


M(wr)

q

a bM

M(rd)

1

M(rd)

M p

25ns

60ns60ns

60ns65ns

150ns

(c) Critical path is from b to p: 150ns + 5ns for mux on p = 155ns.(d) Longest operation in diagram is multiplication: 65ns.(e) Minimum clock period is 65ns + 5ns for register = 70ns.

M(wr)

q

a bM

M(rd)

1

M(rd)

M p

25ns

60ns60ns

60ns65ns

5ns

5ns

5ns

5ns

M(wr)

q

a bM

M(rd)

1

M(rd)

M

25ns

60ns60ns

60ns65ns

5ns

5ns

5nsp

M(wr)

q

p

a bM

M(rd)

1

M(rd)

M

30ns

60ns

65ns

5ns

5ns

5ns

5ns

25ns

60ns5ns

period 70 ns 75 ns 90 nslatency 5 cycles 4 cycles 3 cyclestime 350 ns 300 ns 270 ns

(f) Minimum latency is 3 clock cycles, because can’t do all memoryoperations in parallel and need registers on both inputs and outputs.

(g) Best overall performance for a ≤ b case is with clock period of 90ns.

2. Now try a > b with 90 ns clock period.

P4.6. PERFORMANCE OPTIMIZATION WITH MEMORY ARRAYS 73

(a) a > b means that a 6= b and a 6= b-1, so no memory addressconflicts to create dependencies and complications.

M(wr)

q

p

a bM

M(rd)

1

M(rd)

M

30ns

60ns

65ns

5ns

5ns

5ns

25ns

60ns M(wr)

q

p

a bM

M(rd)

-1

M(rd)

M

25 ns

60ns

65ns

5ns

5ns

5ns

25ns

60ns

period 90 ns 95 nslatency 4 cycles 3 cyclestime 360 ns 285 ns

(b) Without going to a triple-ported memory, can’t reduce latency below3.

(c) Best performance for a > b case is with clock period of 95 ns.

3. Choose 95 ns clock period, which gives a latency of 3 clock cycles forboth options.

4. Optimize dataflow diagrams to reduce area without sacrificingperformance.

M(wr)

q

a

bM

M(rd)

1

M(rd)

Mp

25ns

60ns60ns

60ns65ns

5ns

5ns

5ns

M(wr)

q

p

a

bM

M(rd)

1

M(rd)

30ns

60ns

65ns

5ns

5ns

5ns

5nsM

25ns

5. Merge dataflow diagrams.


M(wr)

q p

a bM

M(rd)

1

M(rd)

30ns

60ns

65ns

5ns

5ns

5ns

5ns

M

25ns

1

M(rd)

0

Optimal performance (Period = 95 ns)Component Quantity

Input 2Output 2Register 5Adder 1Subtracter 1ALU 0Memory read 2Memory write 1Multiplication 12:1 Multiplexor 2Clock Period 95 nsAverage Latency 3 cyclesAverage Execution Time 285 ns

P4.7. MULTIPLY INSTRUCTION 75

M(wr)

q p

a

bM

M(rd)

1

M(rd)

30ns

60ns

65ns

5ns

5ns

5ns

5ns

M

25ns

1

M(rd)

Suboptimal area (two multipliers)

M(wr)

q p

a bM

M(rd)

1

30ns

60ns

65ns

5ns

5ns

5ns

5ns

M

25ns

1

M(rd)

5ns

Suboptimal performance (Period =100 ns)

P4.7 Multiply Instruction

You are part of the design team for a microprocessor implemented on an FPGA. You currentlyimplement your multiply instruction completely on the FPGA. You are considering using aspecialized multiply chip to do the multiplication. Your task is to evaluate the performance andoptimality tradeoffs between keeping the multiply circuitry on the FPGA or using the externalmultiplier chip.

If you use the multipliplier chip, it will reduce the CPI of the multiply instruction, but will not


change the CPI of any other instruction. Using the multiplier chips will also force the FPGA torun at a slower clock speed.

FPGA option FPGA + MULT option

FPGA FPGA

MULT

average CPI 5 ???% of instrs that are multiplies 10% 10%CPI of multiply 20 6Clock speed 200 MHz 160 MHz

P4.7.1 Highest Performance

Which option, FPGA or FPGA+MULT, gives the higher performance (as measured in MIPs), andwhat percentage faster is the higher-performance option?

Answer:MIPs for FPGA option:

MIPsFPGA =MHzFPGA

CFPGA=

2005

= 40

Find MIPs for FPGA+MULT option:

MIPsFM =MHzFM

CFM

Find CPI for MIPS+FPGA option:

CFM = PImult×Cmult +PIother×Cother

Find CPI for non-multiply (other) instructions. Keyinsight is that the CPI for non-multiply instructions isthe same for both the FPGA and FPGA+MULT.

P4.7.1 Highest Performance 77

CFPGA = PImult×Cmult +PIother×Cother

Cother =CFPGA−PImult×Cmult

PIother

=5−0.1×20

0.9

= 3.333

CFM = PImult×Cmult +PIother×Cother

= 0.1×6+0.9×3.333

= 3.6

MIPsFM =MHzFM

CFM

MIPsFM =1603.6

= 44.4

MIPsFM > MIPsFPGA, therefore the FPGA+MULT is the higher performanceoption.

n =PfFM−PfFPGA

PfFPGA

=44.4−40

40

= 11.1%

The FPGA+MULT option is 11% faster than the FPGA option.


P4.7.2 Performance Metrics

Explain whether MIPs is a good choice for the performance metric when making this decision.

Answer:

• MIPs is a good metric for this example, because we are comparing twomicroprocessors that use the same instruction set and will be used in thesame environment.

In general, the disadvantage of MIPs is that it doesn’t take into account thatdifferent instructions accomplish different amounts of work. This causesproblems when comparing microprocessors that use different instructionsets (e.g. one with a cosine instruction and one without).

On an exam, you need to explain whether or not MIPS is a good choice touse based on what is stated in the question. For example, if a questionstates that two processors have the same instruction set, are running thesame program, give some information relating CPI and clock speed, thenyou can state that MIPS would be an okay comparison for the statedreasons.

Chapter 5

Timing Analysis Problems

P5.1 Terminology

For each of the terms: clock skew, clock period, setup time, hold time, and clock-to-q, answerwhich time periods (one or more of t1 – t9 or NONE) are examplesof the term.

NOTES:1. The timing diagram shows the limits of the allowed times (either minimum or maximum).

2. All timing parameters are non-negative.

3. The signal “a” is the input to a rising-edge flop and “b” is the output. The clock is “clk1”.

signal may change

signal is stable

t10 t11

clk1

clk2

b

a

b

t1 t2

t3

t9

t6

t7t4

t5

t8

79

80 CHAPTER 5. TIMING ANALYSIS PROBLEMS

Answer:clock skew t3clock period t7setup time t1hold time t2

P5.2 Hold Time Violations

P5.2.1 Cause

What is the cause of a hold time violation?

Answer:

The cause of a hold time violation is that new data reaches the gate thatenables the input to affect the output before the gate is turned off.

P5.2.2 Behaviour

What is the bad behaviour that results if a hold time violation occurs?

Answer:

The bad behaviour that results from a hold time violation is that the new datawill corrupt the contents of the storage loop, which is trying to store theprevious data. If the new data arrives early enough to satisfy the setupconstraint, then the new data will overwrite the previous data and will slipthrough the latch or flop.

P5.2.3 Rectification

If a circuit has a hold time violation, how would you correct the problem with minimal effort?

Answer:

A hold time violation can be corrected by adding a delay (buffer) to the datapath before the input gate.

P5.3. LATCH ANALYSIS 81

P5.3 Latch AnalysisDoes the circuit below behave like a latch? If not, explain why not. If so, calculate the clock-to-Q,setup, and hold times; and answer whether it is active-high or active-low.

Gate DelaysAND 4OR 2NOT 1

en

d

q

Answer:

en

d

δ1

0

1

δ δ

1 δ

Load mode

en

d

δ0

1

0

1

δ δ

Store mode

From the mode diagrams, if the circuit is a latch, it is active high , becauselatch is in load mode when en=’1’.

Now check if timing of circuit is correct. The critical transition is from loadmode to store mode.

en

d

s1

cnl1

q

Node labels

en

d

s1

cn

l1q

δδδ

δ

Timing diagram for transition from loadto store mode.

Clock-to-Q: 6 (1 NOT, 1 AND, and 1 OR gate from en to q.)

Setup: 6 (1 AND and 1 OR from d to controlling gate for storage loop, 0 gatesfrom enable to controlling gate for storage loop.)

Hold:


• Hold time constraint must prevent new value arriving at d before en sets l1to ’1’.

• Delay along data path is 0.

• Delay along clock path is 1.

• Hold time is 1.

P5.4. CRITICAL PATH AND FALSE PATH 83

P5.4 Critical Path and False Path

Find the critical path through the following circuit:

a

b c d

e

fg

hi

j

klm

P5.5 Critical Patha

bc

d

e

f

g

h l

i

j

k

m



Assume all delay and timing factors other than combinational logic delay are negligible.

P5.5.1 Longest Path

List the signals in the longest path through this circuit.

Answer:

a

bc

d

e

f

g

hl

i

j

km

6

6

8 8

8

8

2 4

10

12

12

12

12

12

18

16

16

168

10

2

2

6 8

4

Longest path is: b, e, g, j

P5.5.2 Delay

What is the combinational delay along the longest path?

Delay: 18

P5.5.3 Missing Factors

What factors that affect the maximum clock speed does your analysis for parts 1 and 2 not takeinto account?

Answer:

• false paths

• wire delay

• clock skew

• clock jitter

P5.5.4 Critical Path or False Path? 85

P5.5.4 Critical Path or False Path?

Is the longest path that you found a real critical path, or a false path? If it is a false path, find thereal critical path. If it is a critical path, find a set of assignments to the primary inputs thatexercises the critical path.


P5.6 YACP: Yet Another Critical Path

Find the critical path in the circuit below.

a b

c

d

e

f

g h

P5.7. TIMING MODELS 87

P5.7 Timing Models

In your next job, you have been told to use a “fanout” timing model, which states that the delaythrough a gate increases linearly with the number of gates inthe immediate fanout. You dimlyrecall that a long time ago you learned about a timing model named Elmo, Elmwood, Elmore,El-Morre, or something like that.

For the circuit shown below as a schematic and as a layout, answer whether the fanout timingmodel closely matches the delay values predicted by the Elmore delay model.

G1

G2

G3

G4

G5

G1

G2 G3 G4 G5

Gate



Symbol Description Capacitance

Cg

Cx

Cy

Resistance

Antifuse R

0

0

0

0

Assumptions:

• The capacitance of a node on a wire is independent of where thenode is located on the wire.

Answer:

Equivalent Circuit:


RRR

R

Cg

Cg

Cg

G5

G4

G3

G2

Cy

Cy

Cy

CyCxCy

G1

R

R

R

R

R

R

Cg

tDG2 = R×Cy +2R×Cx +3R×Cy +4R×Cg

+ 2R(Cy+Cg)+ 2R(Cy+Cg)+ 2R(Cy+Cg)

= 2R×Cx +4R×Cy +4R×Cg

+ 6R(Cy+Cg)

In general, for a similar circuit with fanout n:

tDGn = 2R×Cx +4R×Cy +4R×Cg

+ 2(n−1)R×(Cy +Cg)= 2R×Cx +2(n+1)R×(Cy+Cg)

There are two components in the delay equation:

1. A fixed component that is not a function of the fanout (2R×Cx).2. A component that varies linearly with the fanout (2(n+1)R×(Cy +Cg)).

Yes, the fanout model closely matches the timing predicted by the Elmoremodel.

P5.8. SHORT ANSWER 89

P5.8 Short Answer

P5.8.1 Wires in FPGAs

In an FPGA today, what percentage of the clock period is typically consumed by wire delay?

Answer:40–60%

P5.8.2 Age and Time

If you were to compare a typical digital circuit from 5 years ago with a typical digital circuittoday, would you find that the percentage of the total clock period consumed by capacative loadhas increased, stayed the same, or decreased?

Answer:Decreased.

Justification:• Transistors have gotten smaller, die size has remained roughly the same

size or even increased, clock speeds are increasing.

• Signals are travelling roughly the same distance as before, but drivingsmaller capactive loads. Thus, wire delay is not decreasing much, butcapacitive load is decreasing.

• The clock period is decreasing, so the wire delay is taking up a largerpercentage of the clock period and capacitive load delay is taking up asmaller percentage.

P5.8.3 Temperature and Delay

As temperature increases, does thedelay through a typical combinational circuit increase, staythe same, or decrease?

Answer:Increase.

Justification:• As temperature increases, atoms vibrate more, and so have greater

probability of colliding with electrons flowing with current.

• This increases resistivity, which increases delay.


P5.9 Worst Case Conditions and Derating FactorAssume that we have a ’Std’ speed grade Actel A1415 (an ACT 3 part) Logic Module that drives4 other Logic Modules:

P5.9.1 Worst-Case Commercial

Estimate the delay under worst-case commercial conditions(assume that the junction temperatureis the same as the ambient temperature)

Answer:For worst-case commercial condition, assuming that TA = TJ, Logic Module

delay, tPD, for ACT 3 ’Std’ with 4 fanout is 5.7 ns (see Smith Table 5.2).

Assume this is the slowest path, then estimated critical path delay betweenregisters, tCRIT (worst-case commercial) is:

tCRIT = tPD+ tSUD+ tCO= 5.7ns+0.8ns+3.0ns= 9.5ns

P5.9.2 Worst-Case Industrial

Find the derating factor for worst-case industrial conditions and calculate the delay (assume thatthe junction temperature is the same as the ambient temperature).

Answer:For worst-case industrial conditions, assuming that TA = TJ, the derating

factor is 1.07 (see Table 5.3). Hence the delay tCRIT (worst-case industrial)is: 7% greater than worst case commercial delay: 1.07×9.5 = 10.2ns

P5.9.3 Worst-Case Industrial, Non-Ambient Junction Temperature

Estimate the delay under the worst-case industrial conditions (assuming that the junctiontemperature is 105C).

Answer:For worst-case industrial conditions, the derating factor at 105C is found by

linear interpolation between the values for 85C (1.07) and 125C (1.17).

The interpolated derating factor is 1.12. Hence the delay is: tCRIT(worst-case industrial, TJ = 105 0C) 1.12×9.5 = 10.6ns.

Chapter 6

Power Problems

P6.1 Short Answers

P6.1.1 Power and Temperature

As temperature increases, does thepower consumed by a typical combinational circuit increase,stay the same, or decrease?

Answer:

Power will increase.

Justification:• Leakage power will increase, because the equation for the leakage power

is:

IL ∝ e

−q×VTh

k×T

where T is temperature.

• Short circuiting power will increase because:

– As temperature increases, atoms vibrate more, and so have greaterprobability of colliding with electrons flowing with current.

– This increases resistivity, which increases delay.

– Signals will rise and fall more slowly, which will increase the shortcircuiting time, and hence increase short circuiting power

91

92 CHAPTER 6. POWER PROBLEMS

P6.1.2 Leakage Power

The new vice president of your company has set up a contest forideas to reduce leakage power inthe next generation of chips that the company fabricates. The prize for the person who submitsthe suggestion that makes the best tradeoff between leakagepower and other design goals is tohave a door installed on their cube. What is your door-winning idea, and what tradeoffs will youridea require in order to achieve the reduction in leakage power?

Answer:Increase transistor size so as to increase threshold voltage. This will require

an increase in supply voltage, which will likely increase total power.

Alternative: when increase transistor size, keep the supply voltage thesame, but decrease performance.

Alternative: change fabrication process and materials to reduce leakagecurrent. This will likely be expensive.

Alternative: Use dual-Vt fabrication process.

P6.1.3 Clock Gating

In what situations could adding clock-gating to a circuit increase power consumption?

Answer:

• If the circuitry has a high utilization rate, then the power consumed by theclock gating circuit could be more than that saved in the main circuit.

Alternative: Even if the utilization rate is low, the utilization pattern couldprevent the clock gating circuitry from turning off the clock to main circuit. Forexample, if the circuit receives new data every other clock cycle, it would havea utilization rate of 50%, but might need to be powered up 100% of the time.

P6.1.4 Gray Coding

What are the tradeoffs in implementing a program counter fora microprocessor using Graycoding?

Answer:

P6.2. VLSI GURUS 93

• Gray coding is designed to reduce power, because only one bit changeswhen incrementing or decrementing.

• Program counters usually increment, rather than jump to completelydifferent values. So, using gray coding should reduce power consumption.

• The downside is that the memory system probably doesn’t use gray-codedaddresses, so additional circuitry would be needed to convert between grayand binary codes. This will increase area and likely decrease performance.

• Additionally, the extra circuitry to do the translation might require morepower than is saved by using gray coding.

P6.2 VLSI Gurus

The VLSI gurus at your company have come up with a way to decrease the average rise and falltime (0-to-1 and 1-to-0 transitions) for signals. The current value is 1ns. With their fabricationtweaks, they can decrease this to 0.85ns .

P6.2.1 Effect on Power

If you implement their suggestions, and make no other changes, what effect will this have onpower?(NOTE: Based on the information given, be as specific as possible.)

Answer:Reducing short circuit time from 1 ns to 0.85 ns means reducing

raising/falling time. Hence, the new short circuit power is 85% of original.

P6.2.2 Critique

A group of wannabe performance gurus claim that the above optimization can be used to improveperformance by at least 15%. Briefly outline what their plan probably is, critique the merits oftheir plan, and describe any affect their performance optimization will have on power.

Answer:The plan was probably to increase clock speed by 15%. However reducing

Tshort by 0.15 ns can at most decrease clock period by 2×0.15= 0.30 ns,while clock period≫ 1 ns. Therefore, it does not work.


P6.3 Advertising Ratios

One day you are strolling the hallways in search of inspiration, when you bump into a personfrom the marketing department. The marketing department has been out surfing the web and hasnoticed that companies are advertising the MIPs/mm2, MIPs/Watt, and Watts/cm3 of theirproducts. This wide variety of different metrics has confused them.

Explain whether each metric is a reasonable metric for customers to use when choosing a system.

If the metric is reasonable, say whether “bigger is better” (e.g. 500 MIPs/mm2 is better than 20MIPs/mm2) or “smaller is better” (e.g. 20 MIPs/mm2 is better than 500 MIPs/mm2), and whichonetype of product (cell phone, desktop computer, or compute server) is the metricmost relevantto.• MIPs/mm2

Answer:Unreasonable : with performance we care about the volume of a system,

not its area.

• MIPs/Watt

Answer:reasonable : bigger is better; e.g. cell-phone

• Watts/cm3

Answer:reasonable ; smaller is better; server farm

P6.4 Vary Supply Voltage

As the supply voltage is scaled down (reduced in value), the maximum clock speed that the circuitcan run at decreases.

The scaling down of supply voltage is a popular technique forminimizing power. The maximumclock speed is related to the supply voltage by the followingequation:

MaxClockSpeed ∝(V−VTh)2

V

WhereV is supply voltage andVTh is threshold voltage.

With a supply voltage of 3V and a threshold voltage of 0.8V, the maximum clock speed ismeasured to be 200MHz. What will the maximum clock speed be with a supply voltage of 1.5V?

P6.5. CLOCK SPEED INCREASE WITHOUT POWER INCREASE 95

Answer:

MaxClockSpeed ∝(V−VTh)2

V

MaxClockSpeed1

MaxClockSpeed2=

((V1−VTh)2

V1

)(V2

(V2−VTh)2

)

MaxClockSpeed1 = MaxClockSpeed2

((V1−VTh)2

V1

)(V2

(V2−VTh)2

)

MaxClockSpeed1 = 200MHz(

(1.5V−0.8V)2

1.5V

)(3V

(3V−0.8V)2

)

MaxClockSpeed1 = 40MHz

P6.5 Clock Speed Increase Without Power Increase

The following are given:

• You need to increase the clock speed of a chip by 10%

• You must not increase its dynamic power consumption

• The only design parameter you can change is supply voltage

• Assume that short-circuiting current is negligible


How much do you need to decrease the supply voltage by to achieve this goal?

Answer:Total power:

Power = (A×F× 12CL×V2)

+ (A×F× τ× ISh×V)

+ (IL×V)

Only need to reduce dynamic power, therefore neglect static (leakage) power.

Neglect short circuiting current.


Power = (A×F×12CL×V

2)

F′ = 1.1F

Power′ = Power

Power′ = Power

(A×F′× 12CL×V′2) = (A×F× 1

2CL×V2)

(F′×V′2) = (F×V2)

(1.1F×V′2) = (F×V2)

V′ =

√

(F×V2)

1.1F

V′ = 0.95V

We need to decrease the supply voltage to be 95.3% of its original value.


What problems will you encounter if you continue to decreasethe supply voltage?

Answer:Decreasing the supply voltage will bring it closer to the threshold voltage. As

the difference between the supply and threshold voltage decreases, it willlimit the maximum frequency that the circuit can run at.

This then leads to decreasing the threshold voltage, which will then increasethe leakage current, and raise the static power dissipation:

P6.6 Power Reduction Strategies

In each low power approach described below identify which component(s) of the power equationis (are) being minimized and/or maximized:

P6.6.1 Supply Voltage 97


Designers scaled down the supply voltage of their ASIC

Answer:Scaling the supply voltage (V) reduces all components of the power

equation. Switching power is proportional to the square of supply voltage, soswitching power reduces the most.

P6.6.2 Transistor Sizing

The transistors were made larger.

Answer:Resizing transistor to increase the width to length ratio decreases the

resistance of the transistor, which makes it faster. This means that the supplyvoltage can be reduced to save power while maintaining performance.

However, increasing the width to length ratio increases the capacitance. Aftera certain point, the capacitance increase becomes more significant than thereduction in supply voltage, causing power to increase.

Therefore, resizing is adjusting supply voltage and load capacitance tominimize their product in the switching power component.

P6.6.3 Adding Registers to Inputs

All inputs to functional units are registered

Answer:

When inputs are registered, the activity factor is decreased, which decreasesthe dynamic power. The activity factor decreases because combinationalgates might have glitches, which causes the activity factor to be greater than1. In contrast, a flip-flop will have at most one transition per clock cycle, andso its activity factor will at most 1.

A counter argument is that the additional circuitry for the flops will consumemore power. For reasonable size functional units, adding flops would be onlya minor increase to the area, and hence the power increase from adding theflops would likely be less than the savings from decreasing the activity factor.


P6.6.4 Gray Coding

Gray coding of signals is used for address signals.

Answer:Gray coding reduces the activity factor on signals that typically change by 1

or a small amount. Address signals have this behaviour, in contrast to datasignals, where consecutive values are often completely different.

Reducing the activity factor will reduce the dynamic power.

However, as noted in problem P6.1.4, there are complications in using graycoding. For example, the compiler would need to use gray coding whencalculating addresses, and the computer’s datapath would need a gray-codeadder to compute address values.

Overall, the added design complexity of gray coding is probably too great acost to pay except in situations where the encoding is entirely internal to thesystem being designed.

P6.7 Power Consumption on New ChipWhile you are eating lunch at your regular table in the company cafeteria, a vice president sitsdown and starts to talk about the difficulties with a new chip.

The chip is a slight modification of existing design that has been ported to a new fabricationprocess. Earlier that day, the first sample chips came back from fabrication. The good news is thatthe chips appear to function correctly. The bad news is that they consume about 10% more powerthan had been predicted.

The vice president explains that the extra power consumption is a very serious problem, becausepower is the most important design metric for this chip.

The vice president asks you if you have any idea of what might cause the chips to consume morepower than predicted.

P6.7.1 Hypothesis

Hypothesize a likely cause for the surprisingly large powerconsumption, and justify why yourhypothesis is likely to be correct.

Answer:Leakage current — because same design, but different fabrication process

resulted in power change.

P6.7.2 Experiment 99

P6.7.2 Experiment

Briefly describe how to determine if your hypothesized causeis the real cause of the surprisinglylarge power consumption.

Answer:Measure power consumption of circuit with clock turned off, if higher than

expected, then leakage current is the problem.

P6.7.3 Reality

The vice president wants to get the chips out to market quickly and asks you if you have any ideasfor reducing their power without changing the design or fabrication process. Describe your ideas,or explain why her suggestion is infeasible.

Answer:Reduce the clock frequency.

Chapter 7

Problems on Faults, Testing, and Testability

P7.1 Based on Smith q14.9: Testing Cost

A modern (circa 1995) production tester costs US$5–10 million. This cost is depreciated over thelife of the tester (usually five years in the States due to tax guidelines).

1. Neglecting all operating expenses other than depreciation, if the tester is in use 24 hours aday, 365 days per year how much does one second of test time cost?

Answer:

CostPerSecond = PurchaseCostLifespan

=5×106

5×365×24×60×60

= $0.031 for a US$ 5 million tester= $0.062 for a US$ 10 million tester

2. A new tester sits idle for 6 months, because the design of the chips that it is to test is behindschedule. After the chips begin shipping, the tester is used100% of the time. What is thecost of testing the chips relative to the cost if the chips hadbeen completed on time?

Answer:6 months is 10% of a 5 year lifespan

Therefore the tester will test 90% of the total number of chips that it wouldnormally test.

The cost per chip for testing will be:

101

102 CHAPTER 7. PROBLEMS ON FAULTS, TESTING, AND TESTABILITY

NewTestCost =1

0.90×OrigTestCost

= 111%OrigTestCost

3. The dimensions of the die to be tested are 20mm×10mm. The wafers are 200mm indiameter. Fabricating a wafer with die costs $3000. The yield is 70%. Assume that thenumber of die per wafer is equal to wafer area divided by chip area.

What percentage of the fabrication + test cost is for test if the chip is on schedule andrequires 1 minute to test?

Answer:

DiePerWafer =

⌊WaferAreaDieArea

⌋

=

⌊π× (200/2)2

10×20

⌋

= 157

DieFabCost = WaferFabCostDiePerWafer

=$3000157

= $19.10

DieTestCost = TestCostPerSec×TestTime= $0.062×60= $3.72

TestCostPct =DieTestCost

DieTestCost+DieFabCost

=$3.72

$3.72+$19.10

= 16.3%

Note: the 70% information is extraneous.

P7.2. TESTING COST AND TOTAL COST 103

P7.2 Testing Cost and Total Cost

Given information:


• Each board uses two ACHIPs (plus lots of other chips that we don’t care about)



• Each 50% reduction in fault escapees doubles cost of testing(intuition: doubles number oftests that are run)

• If board-level testing detects faults in either one or both ACHIPs, it costs $200 to replacethe ACHIP(s) (This is an approximation, based on the fact that the cost of the chip is muchless than the total cost of $200).


What fault escapee rate will result in the lowest total cost for ACHIPs?

Answer:From section 7.1.2:

TotCost = NoTestCost+TestCost+EscapeeProb×ReplaceCost

However, here we have two ACHIPs per board, so we need to use theescapee probability to compute the probability of board needing to bereplaced. The revised equation for total cost is:

TotCost = NoTestCost+TestCost+ReplaceProb×ReplaceCost

The testing cost doubles, because we have two ACHIPs per board to test.

The probablity of a board having at least one bad ACHIP (and thereforeneeding to be replaced) is 1 - the probability that both ACHIPs are good.

ReplaceProb = 1− (1−EscapeeProb)2


NoTestCost Testcost EscapeeProb ReplaceProb AvgReplaceCost TotCost$10 $0 32% 54% $108 $118$10 2 × $1 = $2 16% 29% $58 $70$10 2 × $2 = $4 8% 15% $30 $44$10 2 × $4 = $8 4% 8% $16 $34$10 2 × $8 = $16 2% 4% $8 $34$10 2 × $16 = $32 1% 2% $4 $46$10 2 × $32 = $64 0.5% 1% $2 $76

The chips will have a lowest cost if either $8 or $16 is spent on testing andthey have a fault escapee rate of 4% or 2%. We choose to spend $16 ontesting, because that has a lower escapee rate for the same total cost. Thelower escapee rate will improve our reputation for quality.

P7.3 Minimum Number of Faults

In a circuit with i inputs,o outputs, andg gates with an average fanout offo (fo > 1), and averagefanin offi, what is the minimum number of faults that must be consideredwhen using asingle-stuck-at fault model?

Answer:

The minimum number of wire segments to connect a gate or input to fo othergates or outputs is fo + 1. (Assuming fo > 1). If fo = 1, then the minimumnumber of wire segments is 1.

With i inputs and g gates, this results in (i+g)×(fo+1) wire segments.

Each wire segment has two possible faults (stuck-at-1 and stuck-at-0),therefore there are 2×(i+g)×(f+1) potential single-stuck-at faults that must beconsidered.

NOTE: the fanin degree fi does not direcly factor into this equation. However,there is a relationship between the number of gates g, the number of inputs i,the depth of the circuitry, the fanout degree fo, and the fanin degree fi. Forexample, the maximum number of gates whose inputs are all primary inputsis i × fo/fi.

P7.4. SMITH Q14.10: FAULT COLLAPSING 105

P7.4 Smith q14.10: Fault Collapsing

Draw the set of faults that collapse for AND, OR, NAND, and NORgates, and a two-input mux.

Answer:

@0

@0@0

@1

@1@1

@0

@0@1

@1

@1@0

A two-input mux does not have any controlling inputs, so it does not have anycollapsible faults.

P7.5 Mathematical Models and Reality

Given a correct circuit, and a non-stuck-at fault (e.g. bridging AND), will a single-stuck-at faultmodel detect the fault? If so, identify a single-stuck at fault that will detect, or explain why can’tbe detected.

P7.6 Undetectable Faults

Identify oneof the undetectable single stuck-at fault in the circuit below, or say “NONE” if allsingle stuck-at faults are detectable.a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8


P7.7 Test Vector Generation

Your task is to generate test vectors to detect faults in the circuit shown below.

Your manager has said that manufacturing only has time to runthree test vectors on the circuit.

a

b

c

L1

L2

L3

L4

L5

L6

L7L8

P7.7.1 Choice of Test Vectors

Which test vectors should you run and in what order should yourun them?

P7.7.2 Number of Test Vectors

Write a brief statement (justified by data) to support eitherstaying with three test vectors orincreasing the test suite to four vectors.

P7.8 Time to do a Scan Test

A 1.2GHz chip has scan chains of length 30,000 bits, 20,000 bits, 24,000 bits, 25,000 bits, andtwo of 12,000 bits.



Calculate the total test time.

Answer:

We can load and unload all of the scan chains at the same time, so time willbe limited by the longest (30,000 bits).

For the first test vector, we have to load it in, run the circuit for one clockcycle, then unload the result.

Loading the second test vector is done while unloading the first.

P7.9. BIST 107

Clock Cycles Vector 1 Vector 2 Vector 3 ...30,000 Load1 Run30,000 Dump Load1 Run30,000 Dump Load... ... ... ...

TimeTot = ClockPeriod× (MaxLengthVec +NumVecs× (MaxLengthVec +1))

=

(1

0.50×1.2×109

)

×(30,000+500,000× (30,000+1))

= 20.8secs

P7.9 BIST

In this problem, we will revisit the circuit from section 7.2.5, which is shown below. But, thistime we’ll use BIST to test the circuit, rather than analyzing the faults and then choosing testvectors to catch the potential faults.

a

b

c

z

L1

L2

L3

L4

L5

L6

L7

L8

P7.9.1 Characteristic Polynomials

Derive the characteristic polynomials for the linear feedback shift registers shown below:

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

S

R

S

R

S

R

set

d0 q0 d1 q1 d2 q2

Answer:Both circuits have three flops, so their maximum exponent is x3.

A feedback tap on each signal di has corresponds to a coefficient of 1 on xi

in the characteristic polynomial.


The first circuit has feedback taps for d0, d1, and d2. This gives acharacteristic polynomial of:

x3 +x2 +x+1

The second circuit has taps on d0 and d1, but not one on d2:

x3 +x+1

P7.9.2 Test Generation

Do either of the circuits generate a maximal-length non-repeating sequence?

Answer:

For an LFSR with n flops, the length of a maximal-length non-repeatingsequence is 2n−1. Both of the LFSRs under consideration have 3 flops, sowe are looking for a sequence of 7 non-repeating values.

We will first simulate the circuits to see their values, and then demonstratehow characteristic polynomials and division over Galois fields can be used toaccomplish the same thing.

d0 q0 d1 q1 d2 q21) 1 1 0 1 0 12) 0 1 1 0 0 03) 0 0 0 1 1 04) 1 0 1 0 1 1

1 1 0 1 0 1 ←− same as 1)

x3 +x2 +x+1

d0 q0 d1 q1 q21) 1 1 0 1 12) 1 1 0 0 13) 0 1 1 0 04) 0 0 0 1 05) 1 0 1 0 16) 0 1 1 1 07) 1 0 1 1 1

x3+x+1

P7.9.2 Test Generation 109

For x3 +x2 +x+1, we see that it repeats after 4 values.

For x3+x+1, we see that it generates a sequence of 7 different values beforerepeating. The circuit has three flops, so the maximum length sequence ofnon-repeating values it can generate is 23−1, which is 7. Thus, x+x3 is amaximal length linear feedback shift register.

Format for division:

quotientlfsr message

...remainder

For an LFSR with no external input and n flops, the first n coefficients of themessage are the reset values of the LFSR, and all of the other remainingcoefficients are 0.

For a test vector generator LFSR, the reset values are all 1s.

We hope to have a sequence of 7 unique remainders. With the three initialvalues in the LFSR flops, we require a message polynomial of 3 + 7=10values.

The message polynomial is then:1x9+1x8 +1x7 +0x6 +0x5 +0x4 +0x3 +0x2 +0x1 +0x0

Carry out the division:

1x6 + 1x5 + 0x4 + 0x3 + 1x2 + 0x1 + 1x0

1x3 +0x2 +1x1 +1x0 1x9 + 1x8 + 1x7 + 0x6 + 0x5 + 0x4 + 0x3 + 0x2 + 0x1 + 0x0

1x9 + 0x8 + 1x7 + 1x6

1x8 + 0x7 + 1x6 + 0x5

1x8 + 0x7 + 1x6 + 1x5

0x7 + 0x6 + 1x5 + 0x4

0x7 + 0x6 + 0x5 + 0x4

0x6 + 1x5 + 0x4 + 0x3

0x6 + 0x5 + 0x4 + 0x3

1x5 + 0x4 + 0x3 + 0x2

1x5 + 0x4 + 1x3 + 1x2

0x4 + 1x3 + 1x2 + 0x1

0x4 + 0x3 + 0x2 + 0x1

1x3 + 1x2 + 0x1 + 0x0

1x3 + 0x2 + 1x1 + 1x0

1x2 + 1x1 + 1x0

Quotient 1x6 +1x5 +1x2 +1x0

Remainder 1x2 +1x1 +1x0


The values on the flip flops inside an LFSR with n flops show up as then-most-significant coefficients on the polynomials immediately below thesubtraction lines in the long-divison. For example, after the secondsubtraction, the polynomial is:0x7 +0x6 +1x5+0x4. The three most signicant coefficients are: 001 and thevalue on (q2,q1,q0) after two steps of execution is also 001.

P7.9.3 Signature Analyzer

Given a signature analyzer equation ofx2 +x+1, find the expected value of the flops in thesignature analyzer at the end of the test sequence. Also, design the hardware for the signatureanalyzer and result checker.

Answer:

S

R

S

R

S

R

set

q0

q1

q2

mode

i_d(0)

i_d(1)

i_d(2)

z

Connect test generator to circuit

Expected sequence of valuesfrom circuit:

q0 q1 q2 z1) 1 1 1 1 x6

2) 1 0 1 0 x5

3) 1 0 0 0 x4

4) 0 1 0 0 x3

5) 0 0 1 0 x2

6) 1 1 0 1 x1

7) 0 1 1 1 x0

Polynomial for output se-quence of circuit under test:x6 +x+1

Remainder of result sequence divided by signature analyzer is values in flopsof signature analyzer at end of test sequence.

m(x) message (output of circuit under test) x6 +x+1p(x) polynomial of signature analyzer x2 +x+1q(x) quotientr(x) remainder

Format for division:

quotientsignature analyzer circuit under test

...remainder

P7.9.3 Signature Analyzer 111

Carry out the division:

1x4 + 1x3 + 0x2 + 1x1 + 1x0

1x2+1x1 +1x0 1x6 + 0x5 + 0x4 + 0x3 + 0x2 + 1x1 + 1x0

1x6 + 1x5 + 1x4 +1x5 + 1x4 + 0x3

1x5 + 1x4 + 1x3

0x4 + 1x3 + 0x2

0x4 + 0x3 + 0x2

1x3 + 0x2 + 1x1

1x3 + 1x2 + 1x1

1x2 + 0x1 + 1x0

1x2 + 1x1 + 1x0

1x1 + 0x0

Quotient 1x4 +1x3 +1x1 +1x0

Remainder 1x10

Check division:

m(x) = ( q(x) × p(x) ) + r(x)1x6 +x+1 = ( (1x4+1x3 +1x1 +1x0) × (1x2+1x1 +1x0) ) + x1

= ( x6 +1 ) + x= 1x6+x+1

Division was done correctly.

The final value on the three flops in the signature analyzer will be theremainder: 1x1 +0x0 = 10.

NOTE: When looking at the remainder (signature), we look at the outputs ofthe flops, representing the flop nearest the input as x0.

Using hardware:


S

R

S

R

reset

d0 q0 d1 q1i

clk

i

d0

q1

q0

1

0

0

0

01

0

1

1 0 0

1

1

0 0 1 1

1

0

0

0

0

0

0

0 remainder

quotient

d1 0 11 0 1 1 1 0

1 1 0 1 1 1

Signature analyzer and timing diagram

The quotients and the remainder calculated using long division match theones that were calculated using the circuit. The values on the flops in thesignature analyzer match, cycle by cycle, the two most significant coefficientson the intermediate remainders calculated during long division. Theintermediate remainders are the polynomials below the subtraction lines.

(When looking at the circuit, remember that for an LFSR with n flops, it takesn clock cycles for the circuit to become “primed” with the input sequence andmatch the long-division arithmetic.)

The “ok” circuit for this signature analyzer is just a 2-input AND gate, becausethe remainder is 11.

q0q1

S

R

S

R

reset

d0 q0 d1 q2i

ok

Signature analyzer with “ok” circuit

The result checker should check the ok signal one cycle after the last testvector.

The last test vector in the sequence is 110. We can either look for 110 anddelay by one clock cycle, or we can look for the first test vector (111) insecond iteration the sequence. To make sure that we are looking at thesecond iteration of the sequence, and not the first, we look at reset.

P7.9.4 Probabilty of Catching a Fault 113

ok

max

-leng

thLF

SR

signatureanalyzer

q0q1q2

all_ok

circuitundertest

z

Result checker circuit option 1

ok

max

-leng

thLF

SR

signatureanalyzer

q0q1q2

all_ok

circuitundertest

z

reset

Result checker circuit option 2


Find the approximate probability of a fault not being detected

Answer:

We have a sequence of 7 bits coming from the circuit under test.

This gives us 27 = 128possible sequences. Of these, 1 is the good sequenceand 127 are faulty sequences.

The signature analyzer stores 2 bits of data, which gives us 4 possible values.

Thus, on average 128/4 = 32 different result sequences will map to the same2-bit signature.

Of these 32 vectors, 1 is the good sequence and 31 are faulty sequences.

Assume that each result sequence is equally likely to occur.

(NOTE: this is a poor assumption, a full analysis would make each stuck-atfault equally likely, then compute the result vector for each fault.)

With this assumption, there is a 31/127= 24%chance that a faulty sequencewill result in the same signature as the good sequence.

There is approximately a 24% chance that a faulty circuit will not be detected.



If we increase the size of the signature analyzer by one flip flop, by how much do we change thethe approximate probability of a fault not being detected?

Answer:

A signature analyzer with 3 bits of data gives us 8 possible values.

Thus, on average 128/8 = 16 different result sequences will map to the same3-bit signature.

Assuming that each result sequence is equally likely to occur, there is a15/127= 11.8% chance that a faulty sequence will result in the samesignature as the good sequence.

There is approximately a 12% chance that a faulty circuit will not be detected.

Thus, we have decreased the probability of a faulty circuit not being detectedfrom 24% to 12%.

P7.9.6 Detecting a Specific Fault

Determine if a L7@0 is detectable

Answer:

Equation for faulty circuit: a AND b.

Faulty sequence of values from circuit:

a b c z1 1 1 1 x6

1 0 1 0 x5

1 0 0 0 x4

0 1 0 0 x3

0 0 1 0 x2

1 1 0 1 x1

0 1 1 0 x0

Polynomial for result sequence: x6 +x

P7.9.7 Time to Run Test 115

Compute remainder

1x4 + 1x3 + 0x2 + 1x1 + 1x0

1x2+1x1 +1x0 1x6 + 0x5 + 0x4 + 0x3 + 0x2 + 1x1 + 01x6 + 1x5 + 1x4

1x5 + 1x4 + 0x3

1x5 + 1x4 + 1x3

0x4 + 1x3 + 0x2

0x4 + 0x3 + 0x2

1x3 + 0x2 + 1x1

1x3 + 1x2 + 1x1

1x2 + 0x1 + 0x0

1x2 + 1x1 + 1x0

1x1 + 1x0

Quotient 1x4 +1x3 +1x1 +1Remainder 1x1 +1x0

This remainder is different from the remainder for the correct circuit, thus thefault will be detected.

In hardware:

clk

i

d0

q1

q0

1

0

0

0

01

0

1

1 0 0

1

1

0 0 1 1

1

0

0

0

0

0

0

0 remainder

quotient

d1 0 11 0 1 1 1 0

1 1 0 1 1 1

P7.9.7 Time to Run Test

Find the number of clock cycles to run the test

Answer:For a maximal-length LFSR of n bits, it takes 1 clock cycle to reset the circuit,

2n−1 clock cycles to generate the 2n−1 test vectors, and finally one cycle atthe end to flop the results. This gives a total of 2n+1 clock cycles, which inour case is 9.


P7.10 Power and BIST

You add a BIST circuit to a chip. This causes the chip to exceedthe power envelop that marketinghas dictaed is needed. What can you do to reduce the power consumption of the chip withoutnegatively affecting performance or incuring significant design effort?

Answer:When in test mode, run the clock at a lower frequency so that the chip will

consume less power.

Add clock gating to signature analyzer so that it is turned off when the chip isin normal mode.

P7.11 Timing Hazards and Testability

This question deals with with following circuit:

a

b

c

z

L1

L2

L3

L7

L8

L4L9

L5

L6

L10

L11

L12

L13

L14

L15

1. Does the circuit have any untestable single-stuck-at faults? If so, identify them.

Answer:

a bc

None of the minterms are completely covered by other minterms, so thecircuit is irredundant and does not have undetectable faults.

The two minterms ac and ab overlap, but neither is completely covered byother minterms. So, if one of them was stuck at 0, there would be at leastone set of input values that would cause the faulty circuit to differ from thecorrect circuit.

P7.11. TIMING HAZARDS AND TESTABILITY 117

2. Does the circuit have any static timing hazards?

Answer:Moving from abc to abc moves between minterms. Thus, there is a

potential timing hazard.

a

b

c

z

Potential glitch (static hazard)

3. Add any circuitry needed to prevent static timing hazardsin the circuit below, then identifyany untestable single-stuck-at faults in the resulting circuit.

Answer:

a bc

a

b

c

z

L13@0L16@0

L17@0

L18@0L19@0

L1

L2

L3

L7

L4 L9@0

L5

L6

L10

L11 L4L14

L8L12

L15

The minterms ab and bc are both completely covered by other minterms.Thus, these minterms are redundant and are sources of undetectablefaults.

This gives us L13@0 and L19@0 as undetectable single stuck-at faults.

Using gate collapsing, we see that the following faults are equivalent toL13@0: L9@0, L160.

And the following are equivalent to L19@0: L17@0, L18@0.


NOTE: although both L16@0 and L17@0 are undetectable, this does notmean that L2@0 is undetectable. L2@0 is equivalent to having bothL16@0 and L17@0 at the same time. Check the Boolean equations if youare in doubt about this.

P7.12 Testing Short Answer

P7.12.1 Are there any physical faults that are detectable byscan testing butnot by built-in self testing?


Answer:Yes.• A fault that is only detectable with 000 will be detectable by scan testing but

not by built-in self test.

• A fault that results in the same signature as the correct circuit will bedetectable by scan testing but not by built-in self test.

P7.12.2 Are there any physical faults that are detectable bybuilt-in selftesting but not by scan testing?


Answer:No.

Any fault that is detectable by built-in self testing can be detected by scantesting where the test vector that we scan in in the BIST test vector thattriggers the fault.

If “scan testing” is interpreted as “boundary scan testing” and built-in self testis allowed inside a chip, then there are faults that are detectable by built-inself test but not by boundary scan testing. These faults would be insideredundant sequential circuitry. But, this scenario was not intended to be partof this question.

P7.13. FAULT TESTING 119

P7.13 Fault Testing

In this question, you will design and analyze built-in self test circuitry for the circuit-under-testshown below.

P7.13.1 Design test generator

Draw the schematic for a 2-bit maximal-length linear feedback shift register and demonstrate thatit is maximal length.

Answer:

S

R

S

R

d0

q0

d1

q1

clk

1

0

1

1

0

1

1

0

1

1

0

1

1

1

value 3 1 2 3

P7.13.2 Design signature analyzer

Design a signature analyzer circuit for a characteristic polynomial ofx+1.

Answer:

S

R


P7.13.3 Determine if a fault is detectable

Is a stuck-at-1 fault on the output of the inverter detectable with the circuitry that you’vedesigned?

Answer:

1. Equation for correct circuit-under-test is a⊕b.a b output1 1 01 0 10 1 1

2. Simulating correct output sequence 011 through signature analyzer:i 0 1 1

d0 0 1 0q0 0 0 1 0

3. Equation for faulty circuit-under-test is ab+ab.a b output1 1 11 0 00 1 0

4. Simulating faulty output sequence 100 through signature analyzer:i 1 0 0

d0 1 1 1q0 0 1 1 1

5. Output of signature analyzer is different from correct circuit, so the faultwill be detected.

P7.13.4 Testing time

How many clock cycles does your BIST circuitry require to test the circuit under test? Explainhow each clock cycle is used.

Answer:

1. reset circuit

2. run first of three test vectors

P7.13.4 Testing time 121

3. run second of three test vectors

4. run three of three test vectors

5. flop result from circuit under test into signature analyzer

5 clock cycles.

vhdl reference

Education

ambient junction

centric design

related timing

critical path

register transfer

transfer level

state machine

3 design flow