05763799

Upload: prabakaran-ellaiyappan

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 05763799

    1/2

    Latch-based FPGA emulation methodfor design verification: case study withmicroprocessor

    M. Kim, J. Kong, T. Suh and S.W. Chung

    Using latches in a digital designis considered wrongowingto thetiming

    issue. Field-programmable gate array (FPGA) vendors also recommend

    flip-flops instead of latches in emulation.In this reported work, however,

    the usefulness and benefit of utilising latches in FPGA emulation for

    processor design verification is demonstrated. The study shows that alatch-based register file provides the seamless capability of

    functionality validation, whereas the flip-flop based one requires modi-

    fication to the original design, potentially harming the completeness

    of functional verification. Experiment results with Xilinx and Altera

    devices show marginal differences in terms of emulation performance

    and area requirement in both approaches. This study reveals that

    replacing SRAM with latches rather than flip-flops is appealing and

    preferable in emulation with FPGAs.

    Introduction: In digital design, one of the most time-consuming

    processes is verification. Software-based hardware description language

    (HDL) simulation is beneficial in a sense that internal signals of interest

    can be observed. However, it is impractical to validate logic with high

    complexity using HDL simulation because of intolerable simulation

    time. To remedy this shortcoming, field-programmable gate array

    (FPGA) based emulation has been most widely used. It provides the

    capability of validating the design more than 1000 times faster than the

    traditional software-based simulation [1]. However, the FPGA-based

    emulation often requires modification to the original design owing to

    the restricted internal structures and limited resources in FPGAs. For

    example, large caches in modern microprocessors do not typically fit

    into a single FPGA, and they should be split into several FPGAs.

    A microprocessor is one of the most complex digital designs including

    various logics and memories. Its validation requires the exhaustive

    coverage of different combinations of all the instructions, interrupts and

    exceptions. Therefore, FPGA-based emulation is typically an inevitable

    step in the design process. However, some of the logics are not seamlessly

    translated to theFPGA fabric. Oneof such logics isthe registerfile since it

    is often custom-designed with SRAM[2] andthe required numberof ports

    varies depending on the instruction set architecture (ISA). A simple dual-

    issue microprocessor usually requires two write ports and four read ports

    in the register file [2]. In FPGAs,the memoryelements (see Note) support

    a limitednumberof ports. For example, the Altera CycloneII [3]provides

    only two read ports and one write port in the memory element. Thus, the

    register file should be converted by using the logic elements and there are

    two options for implementation: latches or flip-flops. FPGA vendors

    recommend flip-flops rather than latches, insisting that using latches

    incurs complicated timing problems [4].

    The operational difference between latches and flip-flops has a direct

    effect on the digital design. A flip-flop is an edge-triggered device

    enabling a write operation at a rising (or falling) edge of a clock,

    whereas a latch is a level-triggered one at the high (or low) level of a

    clock. Therefore, the operation of a latch-based register file is similar to

    that of the original SRAM-based design. The adoption of the flip-flop-

    based register file in emulation requires the modification of the originaldesign, potentially affecting the validation correctness. Specifically, it

    causes the Read-After-Write (RAW) hazard, which does not exist in

    the original SRAM-based register file. The hazard occurs when the

    destination register of a write operation is the same as the source register

    of a subsequent read operation. Owing to the edge-triggered nature of a

    flip-flop-based register file, the data to be read is not available in the

    current clock cycle because the write operation occurs at the end of the

    clock period. Therefore, the hazard should be resolved by adding

    additional forwarding paths or by stalling the microprocessor. This

    design change impedes the main purpose of emulation and could harm

    the completeness of functional verification.

    In this Letter, we implement a microprocessor with the latch-based

    register file for validation using FPGA emulation and compare it with

    the flip-flop-based one in terms of performance and area. Throughout

    the Letter, we show the usefulness and benefit of using latches invalidation with FPGAs.

    Implemented microprocessors: We compare two versions of a micro-

    processor in emulation: one with a latch-based register file (Pl) and

    the other with a flip-flop-based register file (Pff). Note that Pff requires

    special forwarding paths to overcome the RAW hazard explained

    earlier. The processor is based on ARM9, which has five pipeline

    stages: Instruction Fetch (IF), Instruction Decode (ID), Execution

    (EX), Memory Access (MEM), and Write-Back (WB). It is based on

    ARMv5 instructions except supplementary instructions such as copro-

    cessor, thumb, and load/store multiple instructions.The register file in Pl consists of 15 latch-based registers and one flip-

    flop-based register; the 15 registers are general purpose registers and the

    only register with flip-flops is the program counter (PC). Since latches

    are level-triggered, the data written in the first half of the clock can be

    read in the second half of the clock. Thus, the RAW dependency is

    naturally resolved without any additional forwarding path. Fig. 1

    shows an example of the dependency. In the case of the latch-based

    register file, the result of the first instruction (mov r0, #1) is written

    back in the register r0 in the first half of clock cycle 4. In the second

    half of the same clock cycle, the register r0 is read by the fourth instruc-

    tion (add r4, r0, r5). Therefore, the register file in P l does not need a

    forwarding path from the WB stage to in front of the ID/EX pipelineregister (dotted arrows in Fig. 1). Note that Pff requires this forwarding

    path to resolve the hazard.

    IF ID MEM WBEX

    IF ID MEM WBEX

    IF ID MEM WBEX

    IF ID MEM WBEX

    0mov r0, #1

    1

    mov r1, #1

    2subs r3, r2, #1

    3

    add r4, r0, r5

    R0

    0 1 2 3 4 5 6 7

    clock cycle

    Instruction No.Instruction

    Fig. 1 Example of forwarding from WB to in front of ID/EX pipeline register

    During the actual implementation of Pl, however, the register file

    suffered from timing errors caused by glitch. To remove glitch, we

    utilised an AND gate. Inputs to the AND gate are a phase-shifted

    clock signal (908 in our study) and the original write enable. Then,

    the output of the AND gate is connected to the write-enable for eachregister. As a result, the write enable signal is kept low for one fourth

    of a clock cycle, ignoring wrong data generated by timing errors, as

    shown in Fig. 2. Note that the AND gate is located inside the register

    file and does not affect the original processor design outside the register

    file. The 908 phase-shifted clock is not specially contrived for the latch-

    based register file. It was constructed to maintain the same memory

    (or cache) access latency of one cycle as the original design in the

    MEM pipeline stage. The read latency of the memory elements in

    FPGAs is more than one cycle because of its input register (flip-flops).

    original clock(0o phase shifted)

    phase shifted clock

    (90o phase shifted)

    write enable of register(before conjugation)

    write enable of register(after conjugation)

    data

    wrong data

    Fig. 2 Resolving glitch by utilising phase shifted clock

    The register file in Pff purely consists of flip-flops and enables a write

    operation only at the rising (or falling) edge of a clock cycle. As a result,

    the read and write operation to a register cannot take place in the sameclock cycle, resulting in the RAW hazard. There are two options to

    resolve the RAW hazard: forwarding from the WB stage to in front of

    the ID/EX pipeline or stall to prevent the execution of the fourth instruc-tion with wrong data. Stalling the processor for one cycle leads to a

    ELECTRONICS LETTERS 28th April 2011 Vol. 47 No. 9

  • 8/6/2019 05763799

    2/2

    different execution time of a program compared to the original design

    with the SRAM-based register file. Furthermore, the stall logic should

    be added as well. The forwarding option resolves the RAW hazard

    without affecting the execution cycle time. Nevertheless, the forwarding

    path is located outside the register file and may cause unexpected side-

    effects such as functional errors hidden in the extra forwarding path.

    Thus, Pff requires extra verification process after replacing the register

    file with the original SRAM-based one and removing forwarding paths.

    Analysis and discussion: In this Section, we present experiment results

    with FPGAs (Altera Cyclone II and Xilinx XC3S500E FPGAs):

    maximum frequency and area for P l and Pff. The maximum frequency

    is obtained by analysing the critical path of Pl and Pff from the synthesis

    report of the design tools for each FPGA (Altera Quartus II 9.1 Web

    Edition and Xilinx ISE 12.2). The area is also obtained from the same

    report.

    The maximum clock rates of Pl and Pff are similar on both FPGAs, as

    shown in Table 1. Cyclone II reports a 5MHz lower frequency for Pl than

    that of Pff. XC3S500E reports exactly the same frequency for Pl and Pff.

    The difference in clock rates is caused by the characteristic of the storage

    elements (flip-flops or latches) in each FPGA. Cyclone II has configur-

    able storage elements called dedicated logic registers, which are located

    inside each logic element. However, the dedicated logic registers can

    only be used as flip-flops. In other words, the latches are implemented

    by configuring and routing logic elements, consuming more logic

    elements. On the other hand, XC3S500E can configure the storageelements (called slice flip-flops) as latches. Hence, the implementation

    of a latch does not require an additional logic element to be configured

    or routed, compared to the flip-flop implementation. This feature of

    Cyclone II impacts more significantly on the area. P l occupies a larger

    area than Pff by 14.3% on Cyclone II, while Pl utilises only a 0.2%

    larger area than Pff on XC3S500E.

    Table 1: Area and performance of Pl and Pff

    FPGA type Altera Cyclo ne II Xilinx XC3S500 E

    Register

    File type

    Flip-flop

    based (Pff)

    Latch

    based (Pl)

    Flip-flop

    based (Pff)

    Latch

    based (Pl)

    Area 4 058 LEs 4 639 LEs 24 74 slices 2 47 8 slices

    Performance(clock frequency) 55 MHz 50 MHz 35 MHz 35 MHz

    Conclusion: We have demonstrated the usefulness and benefit of utilis-

    ing latches in emulation with FPGAs. In the processor emulation, the

    latch-based register file provides the seamless capability of functional

    validation, whereas the flip-flop-based one requires extra logic in a

    processor which potentially harms the functional verification. Both

    approaches do not show the notable differences in terms of emulation

    speed andarearequirement.Our studyshowsthatthe latch based approach

    for the register file is appealing and preferable in functional validation

    with emulation using FPGAs.

    Note: An FPGA usually includes two kinds of elements: memory

    element and logic element. Memory element can only be configured

    as memory whereas logic element is able to be configured into many

    different kinds of combinational or sequential logics.

    Acknowledgments: This work was supported in part by the Ministry of

    Knowledge Economy, Korea, under the Information Technology

    Research Centre support programme supervised by the National IT

    Industry Promotion Agency (NIPA-2011-C1090-1121-0010).

    # The Institution of Engineering and Technology 2011

    19 February 2011

    doi: 10.1049/el.2011.0462

    M. Kim, J. Kong and S.W. Chung ( Division of Computer and

    Communication Engineering, Korea University, Seoul 136-713,

    Republic of Korea)

    E-mail: [email protected]. Suh ( Department of Computer Science Education, College of

    Education, Korea University, Seoul 136-713, Republic of Korea)

    References

    1 Nakamura, Y., Hosokawa, K., Kuroda, I., Yoshikawa, K., andYoshimura, T.: A fast hardware/software co-verification method forsystem-on-chip by using a C/C++ simulator and FPGA emulatorwith shared register communication. Proc. of 41st Annual DesignAutomation Conf., (DAC04), San Diego, CA, USA, 2004, pp. 299304

    2 Homayoun, H., Gupta, A., Veidenbaum, A., Sasan, A., Kurdahi, F., andDutt, N.: RELOCATE: register file local access pattern redistributionmechanism for power and thermal management in out-of-orderembedded processor, Lect. Notes Comput. Sci., 2010, 5952/2010,pp. 216231

    3 Altera Corporation: Cyclone II memory blocks, Cyclone II DeviceHandbook, Vol. 1, Chapter 8, February 2008

    4 Xilinx: Xilinx design reuse methodology for ASIC and FPGAdesigners, Reuse Methodology Manual For System-on-Chip Designs

    ELECTRONICS LETTERS 28th April 2011 Vol. 47 No. 9