bar ilan university school of engineering vlsi lab...

34
1 Bar Ilan University School of Engineering VLSI Lab Data Driven Clock Gating Academic Advisor: Prof. Shmuel Wimer Instructor: Mr. Moshe Doron Industry correspondent: Mr. Roey Mioni Dov Gropper Dvir Shasha Final Fourth Year Project Computer Engineering

Upload: trantu

Post on 30-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Bar Ilan University

School of Engineering

VLSI Lab

Data Driven

Clock Gating

Academic Advisor: Prof. Shmuel Wimer

Instructor: Mr. Moshe Doron

Industry correspondent: Mr. Roey Mioni

Dov Gropper

Dvir Shasha

Final Fourth Year Project

Computer Engineering

2

Table of Contents Main Project Goals ........................................................................................................................ 3

Motivation ..................................................................................................................................... 3

Theory .......................................................................................................................................... 4

Design Flow .................................................................................................................................. 7

Design: ...................................................................................................................................... 7

Simulation environment: ............................................................................................................ 8

Iterative Perfect Matching Algorithm (IPM): ............................................................................... 8

Clock gating Implementation: .................................................................................................... 9

Hardware and Design Components .............................................................................................10

Problems and Solutions ...............................................................................................................12

Direct Memory Accesses Controller .............................................................................................14

Behavior ...................................................................................................................................14

System level .............................................................................................................................14

The block diagram of the DMA controller's state machine: .......................................................16

The Design: ..............................................................................................................................17

Top design, with Verification diagram: ......................................................................................18

Results .........................................................................................................................................20

The SpyGlass Results: .............................................................................................................21

Result review: ...........................................................................................................................24

Conclusions .................................................................................................................................25

References and Sources..............................................................................................................27

Appendixs ....................................................................................................................................28

DMAC Spec: .....................................................................................................................28

3

Main Project Goals

Data Driven Clock Gating is a research study by Professor Shmuel Wimer. Its’ main

purpose is to reduce power consumption of electronic circuits.

Our project implements the technique described in Professor Wimer’s research on a

design in register transfer level (RTL).

Our project consisted of the following stages:

● Implementation of Data-Driven Clock Gating on a given design.

● Creation of a design flow for this implementation.

● Attain an estimate of the power consumption reduction.

Motivation

The increasing demand for low power mobile computing and consumer electronics

products has refocused VLSI design in the last two decades on lowering power and

increasing energy efficiency. Power reduction is treated at all design levels of VLSI chips.

From the architecture through block and logic levels, down to gate level circuit and

physical implementation, one of the major dynamic power consumers in the system clock

signal, typically responsible for up to 50% of the total dynamic power consumption. Clock

network design is a delicate procedure, and is therefore done in a very conservative

manner under worst case assumptions. It incorporates many diverse aspects such as

selection of sequential elements, controlling the clock skew, the decision of the topology

and physical implementation of the clock distribution network.

4

Theory

Clock gating

Several techniques to reduce the dynamic power have been developed, of which clock

gating is predominant. Ordinarily, when a logic unit is clock, its underlying sequential

elements receive the clock signal regardless of whether or not they will toggle in the next

cycle.

Clock enabling signals are usually introduced by designers during the system and clock

design phases, where the inter-dependencies of the various functions are well

understood. In contrast, it is very difficult to define such signals in the gate level,

especially in control logic, since the inter-dependencies among the states of various flip-

flops depend on automatically synthesized logic. There is a big gap between block

disabling that is driven from the HDL definitions, and what can be achieved with data

knowledge regarding the flip-flops activities and how they are correlated with each other.

The research presents an approach to maximize clock disabling at the gate level, where

the clock signal driving a flip-flop is disabled (gated) when the flip-flop states is not subject

to a change in the next clock cycle.

Clock gating does not come for free. Extra logic and interconnects are required to

generate the clock enabling signals, and the resulting area and power overhead must be

considered. In the extreme case, each clock input of a flip-flop can be disabled

individually, yielding maximum clock separation. This, however, results in high overhead.

Thus, the clock disabling circuit is shared by a group of several flip-flops in an attempt to

reduce the overhead.

5

On the other hand, such grouping may lower the disabling effectiveness, since the clock

will disabled only when the inputs to all the flip-flops in a group don’t change. It is,

therefore beneficial to group flip-flops whose switching activities are highly correlated in

derive a joined enabling signal.

This requires gathering statistical information of our flip-flops using simulations, and

statistical analysis.

Another issue that influences the effectiveness of this suggested technique is the fan-out

of the gater. The theory presents a formula for calculating the optimal fan-out of the gater,

referred to as k:

When: q- the probability for flip-flop input stability.

CFF, Clatch , Cw - the capacities of extra flip-flops, latches, and wires.

In our project, we approached this issue by implementing different fan-outs and

estimating the effectiveness of each one on the power consumption reduction.

6

The graph above shows the normalized power net savings per flip-flop obtained by

adaptive gating at first level of clock tree in the equation above. The saving is compared

to the non-gated situation. The optimal fan-out is marked for each toggling probability:

Using the statistical information gathered and the optimal fan-out, we could attain groups

of matching flip-flops for the clock gating.

7

Design Flow

Design:

The design flow begins with a design in RTL. It is important to begin with a design that

has been proven to work properly. The design must not include any IP’s (intellectual

property) or RTL sources that are not visible to the user, and therefore cannot be edited.

At this point, the design flow supports implementation for a single clock domain.

Moreover, the sequential and combinational logic must be separated in the RTL in order

for the scripts to run properly.

8

Simulation environment:

In this stage, simulations on the RTL are performed and statistical information is gathered

for analysis. This must simulate a typical use of the design, so that we can achieve

realistic statistical information. There is support currently for one simulation per design.

The simulation runs with Cadence's SimVision.

The Simulation environment steps:

● Add tracing code to the design. This is obtained by running the program ftrc.exe.

To run this program, the user must first modify the file inputs.rti. In the file, the user

sets the following attributes:

○ Specifying the name of the design files list including extension (*.vc).

○ The program gives an option whether to get the output design as one or

multiple files.

● Add tracing code to the test-bench manually. The code must be added before the

DUT instantiation.

● At this point, the simulation can run.

● The simulation outputs will contain two files:

○ Activities.rpt - the report file contains the active flip-flops per time in

millisecond.

○ FF_lists.rpt - the report file contains a list of the flip flops in the design.

Iterative Perfect Matching Algorithm (IPM):

This algorithm runs iteratively to select flip-flops with high correlation of toggling.

The following steps are to be taken:

● Run PrepareIPM.exe. This program takes the activity.rpt and FF_list.rpt, and

creates different files for each domain.

● Modify the batch file IPM_Run.bat in the following manner:

○ The first parameter is the execution file name, should not be changed.

○ The second is the number of iteration the algorithm will run. The number of

iteration represents the fan-out of the gater as such: 2^x = k (when x is the

number of iteration, and k is the gater fan-out).

○ The third is currently not used. It is the maximal physical distance between

flip-flops that will be gated together.

○ The fourth and fifth parameters are the names of the output files from the

previous stage.

○ Batch file example:

"Iterative Perfect Matching.exe" 7 50 "Activity_0.rpt" "FFList_0.rpt" > log.txt

● Run IPM_Run.bat

● The IPM outputs will contain a file for each iteration and domain contains the FF’s

groups.

9

Clock gating Implementation:

This is a executable file - “rcg.exe” that creates a clock gate for each group of FFs

received from the IPM - thus completing the process.

These are the necessary steps:

● Modify the “inputs.rcg” file with the following:

○ The name of the design files list (*.vc).

○ The name of the folder containing the FF-lists for grouping.

○ Specify whether to get the output design as one-file or multiple.

● Run “rcg.exe”.

Design with Data Driven Clock Gating:

This design has the same logic behavior as the initial design. However, it contains the

gate components according to the implementation process.

It is important to note that the process begins and ends with a design in RTL.

10

Hardware and Design

Components

Direct Memory Access Controller in RTL design:

We implemented our design flow on a Direct Memory Access Controller in RTL.

The design does not include any IP’s (intellectual property) and has been developed by

Mr. Moshe Doron in Bar-Ilan’s VLSI lab.

The design has a single clock domain and after editing, the sequential and combinational

logic have been separated.

Altera's FPGA board DE-115:

This is the first board used for implementation of the design (before and after) in order to

measure the power usage.

Altera's Quartus II:

We used Altera’s EDA, Quartus II. It is programmable logic device design software. It’s

features we used include:

● Synthesis and implementation of Verilog for hardware description.

● Burning the design in gate level on the DE-115 FPGA board.

Xilinx's FPGA board ML605:

This is the second board used for implementation of the design (before and after) in order

to measure the power usage. The reason we switched to this board was that it possesses

a 0.005 ohm serial resistor which is needed in order to measure the current that flows out

of the FPGA chip directly.

Xilinx's ISE:

ISE Is a software tool produced by Xilinx that we used for synthesis and analysis of our

HDL design. In addition, after we received a gate level design from ISE’s synthesis we

burned the design onto the ML605 FPGA board - In similar to our usage of Altera’s

Quartus II.

11

Cadence's SimVision:

Simvision is the waveform viewer in the Cadence EDA suite. We mainly used it for

behavioral design verification before FPGA implementation. Due to the RTL changes

made, it was necessary to verify our design before implementing our design flow of Data

Driven Clock-Gating. After the implementation, we used it to make sure the design still

had the exact same behavior. We used a test bench with two “OK” signals that indicated

proper behavior of the design. It is crucial that we used the same test bench on both

designs, so that we can positively know that the two designs really had the same

behavior.

12

Problems and Solutions

The evil design problem:

We started our project with the Animation Graphic Engine (AGE) RTL design.

The design used Altera's tools (IP) to implement some elements.

This code could not be edited or viewed. Trying to design these components ourselves

failed due to memory resources shortage on the FPGA board. In other words, the Quartus

synthesized the design with our components and the output was too large to be

implemented on any of the FPGA boards in the VLSI lab’s possession.

The Solution we chose was to switch to alternative RTL design, DMA Controller, which

uses no IP.

The “power” to measure power problem:

The design was implemented on Altera's FPGA board but power consumption savings

measurement failed because the Ampere-meter measured board consumption rather

than FPGA consumption i.e. Lack of resolution.

In order to try and solve this problem, we implemented the design on Xilinx's ML-605

FPGA board which has built-in serial 0.005 ohm resistor to FPGA Power pads, enabling

power measurement of the design solely, using ISE SW.

Unfortunately, this still did not solve our problem. Due to high static power consumption of

the FPGA chip itself, we were not able to measure the actual power consumption of the

design without getting a lot of bias power disruptions. In order to try and enhance the

design in contrast to the static power, we multiplied it by 10, and then by 100. The power

consumption measured for one design compared to 100 was extremely similar - thus

proving that our ability to measure power on the FPGA board is limited. In addition, we

contacted field application engineers from Xilinx and they agreed that there is no solution

to our problem short of ASIC implementation. This was not an option available for us.

The Tri-state Area Problem

Our first flow ran on a design in gate level, and not in RTL. The “tracing” ingredients of the

flip-flops were created as special flip-flops that had a tracer inside them. This meant that it

was necessary to compile the RTL before beginning the process. For this, we used the

RC (RTL compiler). The RC was instructed to use the special flip-flops in order to

synthesize the RTL, that way a simulation that gathers toggling information could later be

run.

The problem was that the RC synthesized the RTL using tri-state buffers. At the FPGA

synthesis stage, the Altera Quartus gave a compilation error since it cannot synthesize tri-

state buffers.

13

The problem was solved by moving the design flow to RTL, and allowing the Altera

Quartus to synthesize the design without these limitations.

In addition, this change also shortened the runtime of the entire design flow.

The Logical separation Problem

In the design flow, the clock gating implementation step required that the design’s

sequential and combinational logics will be separated in the RTL.

Because of that, we separated the DMA controller sequential and combinational logics.

It is not ideal to change the design the original design in order to perform the flow.

However, the tools are still under development and will be more versatile in the future.

An explanation of these types of logic:

Combinational logic - circuits that implement Boolean functions.. These circuits are

functions of input only. An example of combinational logic:

Sequential logic - Like combinational logic circuits, a sequential logic circuit has inputs

and outputs. However, the output depends on the state of a FSM as well as the inputs.

Furthermore, it contains a clock.

An example of sequential logic:

14

Direct Memory Accesses

Controller

Background

As mentioned before, the DUT was changed from the AGE to the DMAC. This meant we

had to become familiar with the DMAC logic, and behavior - due to the fact that we

needed to create test-benches for it.

Behavior

The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0

Device. The Device is dedicated to USB Communication Channel. It has the potential of

being integrated into the Protocol Engine (PE) Device. The DMAC function, within the

GOK Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit

(RX/TX) Packet Buffers and the Device Animation Graphics Engine (AGE) Function

Endpoints, in response to PE service requests. The DMAC is the only Bus Master in the

system. It is pre-configured to perform the required data transfers to and from the AGE

Application Function Core. The DMAC is capable of performing words gather-scatter,

support system data bus width up to 48bits (6 bytes) and up to 24bits address bus

(16Mbytes address range).

Flyby and gather-scatter data transfer modes are supported but memory to memory

transfers is not.

System level

The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data

transfers to and from the AGE Application Function Core. The DMAC Configuration

Memory contains the necessary information to access any Endpoint Buffer (Memory

or Register Files), in the AGE Core.

15

The PE issues a Transaction Request command signal and a Packet Transfer

Request signal to the DMAC, for a specific AGE Endpoint. The DMAC responds with

an Acknowledge signal to the PE and starts data transfer transactions between PE

Packet Buffers and EP Buffers – Registers or Memory, over the system bus by

issuing Endpoint Buffer Address, Read and Write control signals, while monitoring

AGE Wait signal (for slow Memories). Data transfers are performed in either single

bus cycle 16bit words data transfer (flyby mode) or in multiple bus cycles (gather-

scatter mode), to match different source and destination bus widths. In both single

packet and multiple packets data transfers, terminating specific EP Input Transaction

(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal,

issued by the Function (last EP Buffer address reached). In case of Output

Transaction (from Host to EP), if last packet size is smaller than the predefined EP

MaxPacketSize or packet having data size = 0 (zero), the PE de-asserts its Transfer

Request signal. In case of multiple packets data transfers, only the Packet Transfer

Request signal is de-asserted and the DMAC will carry on with next packet data

transfers as soon as the Packet Transfer Request signal is be asserted. When both

Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle

state and is ready to perform the next transfer request.

The DMAC access PE’s RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated

PE read/write signals. Data is transferred over the system data bus, as 16bit words.

16

The block diagram of the DMA controller's state machine:

Notice that the flow splits left and right for the two directions: Rx path, and Tx path. Inside

each direction there are more splits, for different data sizes.

17

The Design:

This is a block diagram of the DMAC design. It is constructed from Data interface unit

(DIU), Finite State Machine unit and a Configuration ROM. On the left is the interface with

the protocol engine. On the right, is the interface with the System Bus, and function.

The next stage was to create an environment that would allow us to visually verify the

design on an FPGA board.

18

Top design, with Verification diagram:

This block diagram represents the design that was implemented on the FPGA board.

In addition to the DMA controller we used a stimuli ROM triggered by an 8 bit counter to

resemble data from Protocol engine or from i.e. the AGE.

To confirm the correctness of the data transfer we used 77Bit comparator and a monitor

ROM. The comparator compared the transferred data to the expected result stored in the

monitor ROM and using two LEDs if the data was transferred correctly and also if the

DMA control signals were in the correct state.

The components:

● 8-bit counter: a regular 8 bit counter. Each clock the count is increased by 1. The

output will return to 0x00 upon reach of 0xFF or reset.

● Stimuli ROM: a Read Only Memory component that contains the data that will be

pushed in the inputs of the DMAC. It is made of 57 bit words. It receives the

address from the 8-bit counter as an input.

19

● Monitor ROM: a Read Only Memory component that contains the data that should

be the output of the DMAC according to the input address. It receives the address

from the 8-bit counter as an input.

● 77-bit Comparator: A unit that compares the expected data (from monitor ROM)

to the collected data (from the DMAC). It splits the comparison into two: data, and

control signals. If the expected and collected are identical - both LEDs should be

on.

And so, if both LEDS are on during running- the design is working properly. It is important

to note that during reset, only the data OK will be on.

After debug work of the test bench, we achieved two working designs- with and without

Data Driven Clock Gating.

20

Results

In parallel to our work this year, our flow was run on designs at CEVA.

The VLSI department at CEVA already used clock gating in their design flow.

Their gaters are based on control signals. That means that if the entire clock domain is

not functioning at a given time, the clock signal is blocked and is not forwarded to the

specific clock domain.

The clock gaters we suggest in the design flow are based on data and statistical

information.

The data driven clock gaters were added to the design additionally to the control driven

clock gaters. This fact limited the process in terms of power reduction, because the

design was already power reduced.

To prove the potential of the design flow an activity test was made on the DUT. In this test

Flip Flops that did not needed a clock signal were sampled:

The table above shows that almost 98% of the Flip Flops active only 0-5% of the entire

test. This means that there is potential of saving power by implementing the technique on

the design. However, that is not enough to insure that saving is possible. It is also

necessary to show that many flip-flops have high correlation between their clock-toggle

vectors, in order to gate them together. The following graph shows just that:

21

The X-axis is the correlation percentage. The Y-axis is the number of flip-flops with the

appropriate correlation percentage. As can be seen, there are a very small percentage of

flip-flops with low correlation, and a very large percentage of flip-flops with high

correlation.

Now we can soundly predict high power saving potential.

After implement the entire design flow on three different designs and masure power with

simulation program, Spyglass, the results received in CEVA were:

The SpyGlass Results:

The power reduction percentages that SpyGlass measured:

Design A: 22%

Design B: 15%

Design C: 13%

22

The tables below shows the detailed results received with Spyglass on Design C:

Golden design Leakage Internal Switching Total

Total Power: 337uW 12.7mW 40.0mW 53.0mW

Combinational Power: 224uW 2.65mW 22.3mW 25.2mW

Sequential Power: 95.2uW 8.96mW 1.05mW 10.1mW

Black Box Power: 0W 0W 10.3uW 10.3uW

Memory Power: 0W 0W 0W 0W

IO PAD Power: 0W 0W 0W 0W

Clock Power: 17.8uW 1.10mW 16.6mW 17.7mW

Above is the power measurement report that was derived from analysis of the golden

design. This means that no data-driven clock gating was performed on the design.

The next table shows the main power consumption data according to a given k. This

means that the data-driven clock gating process ran, and a separate design was created

for each gater fan-in size.

23

Total Switching Internal Leakage k

53.0mW

40.0mW 12.7mW 337uW golden

60.6mW

49.6mW 10.6mW

415uW k=4

52.3mW

42.8mW 9.04mW 398uW k=8

49.5mW

40.2mW 8.83mW 388uW k=16

50.2mW

40.3mW 9.52mW 387uW k=32

52.5mW 41.8mW 10.3mW 385uW k=64

52.6mW 41.2mW 11.0mW 383uW k=128

It is easily noticeable that most k’s reduce power consumption.

The Power reduction vs. K (fan-in):

24

The following table shows the power consumption data for k=16 fan-in, except for some

variations that we’re done outside of the design flow.

K=16 (with few variations) Leakage Internal Switching Total

Total Power: 378uW 8.88mW 37.1mW 46.3mW

Combinational Power: 255uW 2.82mW 23.2mW 26.3mW

Sequential Power: 100uW 5.14mW 1.62mW 6.86mW

Black Box Power: 0W 0W 10.8uW 10.8uW

Memory Power: 0W 0W 0W 0W

IO PAD Power: 0W 0W 0W 0W

Clock Power: 22.9uW 917uW 12.2mW 13.2mW

This design was 22% more efficient than the original golden design above.

Result review:

● It can be noticed that with the k=16, the power saving is maximal. Also when the

fan-in is too small as in k =4 the power increases.

● The combinational power increases with Data Driven Clock Gating as a result of

the extra logical component, the gaters. But the sequential power and the clock

power decreases more significantly because of the clock disabling techniques.

● Although the design already had control driven clock gating the activity test shows

that there is still room to save power because the activity of 98% of the Flip Flops

were low and the correlation between the most of them were high.

25

Conclusions The results that have been shown in the last chapter have proven beyond doubt that

Professor Shmuel Wimer’s research “Data-Driven Clock Gating” is a practical and

efficient power reduction tool. The design flow that was developed during this project

made the research a practical tool that could transform a given RTL design into a more

energy efficient one.

Design flow review:

● The ability to work in RTL mode saved a lot of runtime of the design flow and made

it more effective. This issue change becomes more relevant, and even crucial,

when implementing this design flow on a large design. That is due to exponential

growth of runtime in every stage of the design flow.

● We added overhead to the design in the form of logical components, the gaters.

The ability to combine a number of Flip Flops together with statistical knowledge as

a tool was the power saving main element. Both of these aspects appeared in the

result in the form of decrease and increase of power in the final design.

● Even when a design has clock gaters driven by control the Data-Driven Clock

Gating proves effective. The fact that most of the Flip Flops were not active in most

of the run time, and the high correlation between most of them made it possible to

decrease power despite the control driven gaters.

● There is still room for improvement of the versatility and user friendliness of the

scripts and the design flow. The disadvantages of the scripts create a need to

change the design. This happens because the scripts can’t handle a design that

has both combinational and sequential logic mixed. In addition, the scripts won’t

work on a design that has a synchronous reset. The code addition to the test-

bench necessary for the tracing stage should be done as part of the flow (by one of

the programs) and not manually. It would be Ideal to create a main program with a

user interface (GUI) that would combine the entire design flow. That way, the flow

would be easier to run and more user friendly.

● There is still a need to achieve results in ASIC to confirm the efficiency of

implementing Data-Driven Clock Gating.

● The need of a good simulation that mimics a real application use of the design will

have significant influence on the effectiveness of the design flow. This is due to the

fact that the technique is based on statistics and correlation and the more realistic

the simulation the statistical results would be accurate.

● Our attempts to measure the power consumption on the FPGA boards were not

successful. The reason was that the boards has a tremendous static power

26

consumption level, due to all its’ BRAMs and LUTs. Even after multiplying the

design 100 times and measuring the power consumption with the ISE Chipscope

using the built in 0.005 ohm serial resistor - the power difference was not apparent.

That is probably the reason FPGA boards are used in the industry in order to

check design integrity of low power devices, and the actual devices are

manufactured using ASIC.

27

References and Sources 1. The Optimal Fan-Out of Clock Network for Power Minimization by Adaptive Gating

– By Shmuel Wimer and Israel Koren.

2. Optimal Flip-Flop Grouping in Data-Driven Clock Gating for Maximal Power Saving

– By Shmuel Wimer and Israel Koren.

28

Appendix A DMAC Spec:

USB2.0 aware DMAC Specification

1. Introduction The document defines a USB2.0 protocol-aware Direct Memory Access Controller (DMAC)

Device.

The DMAC is an integral part of the vendor-specific Graphics-On-Key (GOK) USB2.0

Device.

The Device is dedicated to USB Communication Channel. It has the potential of being

integrated into the Protocol Engine (PE) Device. The DMAC function, within the GOK

Device, is to transfer data between the USB2.0 Protocol Engine Receive/Transmit (RX/TX)

Packet Buffers and the Device Animation Graphics Engine (AGE) Function Endpoints, in

response to PE service requests. The DMAC is the only Bus Master in the system. It is pre-

configured to perform the required data transfers to and from the AGE Application Function

Core. The DMAC is capable of performing words gather-scatter, support system data bus

width up to 48bits (6 bytes) and up to 24bits address bus (16Mbytes address range).

Flyby and gather-scatter data transfer modes are supported but memory to memory transfers

does not.

2. System Level Introduction

Fig. 1 - GOK USB2.0 Device System Block Diagram

The USB 2.0 Device DMAC is pre-programmed (ROM), to perform the required data

transfers to and from the AGE Application Function Core. The DMAC Configuration Memory

contains the necessary information to access any Endpoint Buffer (Memory or Register Files),

in the AGE Core.

The PE issues a Transaction Request command signal and a Packet Transfer Request signal to

the DMAC, for a specific AGE Endpoint. The DMAC responds with an Acknowledge signal

to the PE and starts data transfer transactions between PE Packet Buffers and EP Buffers –

Registers or Memory, over the system bus by issuing Endpoint Buffer Address, Read and

Write control signals, while monitoring AGE Wait signal (for slow Memories). Data transfers

are performed in either single bus cycle 16bit words data transfer (flyby mode) or in multiple

bus cycles (gather-scatter mode), to match different source and destination bus widths. In both

single packet and multiple packets data transfers, terminating specific EP Input Transaction

(from EP to Host), is done by the DMAC monitoring the End-Of-Transaction signal, issued by

the Function (last EP Buffer address reached). In case of Output Transaction (from Host to

EP), if last packet size is smaller than the pre-defined EP MaxPacketSize or packet having

data size = 0 (zero), the PE de-asserts its Transfer Request signal. In case of multiple packets

System Bus

USB

Connector Transceiver

Chip (PHY) UTMI

Protocol

Engine DMAC AGE

29

data transfers, only the Packet Transfer Request signal is de-asserted and the DMAC will carry

on with next packet data transfers as soon as the Packet Transfer Request signal is be asserted.

When both Transfer Request and Packet Request are de-asserted, the DMAC resorts to its idle

state and is ready to perform the next transfer request. The DMAC access PE’s RX/TX Buffers (FIFOs), as an I/O Devices, using dedicated PE

read/write signals. Data is transferred over the system data bus, as 16bit words.

USB2.0-aware DMAC DMAC's three main modules are the Control Core (FSM), Configuration ROM and the Data

Interface

Unit (DIU).

2.1. DMAC Top Level Introduction

Fig. 2 - DMAC Block

Diagram

2.2. DMAC Modules The DMAC is partitioned into modules as shown in Fig. 2 Block Diagram and described

below.

2.2.1. Configuration ROM

The Configuration ROM contains the essential information necessary to access any

pre-defined Application Function Endpoint Buffer (Memory or Register Files). The

Configuration information enables the DMAC to properly carry out the data

transactions, requested by the PE. Since PE issues at transaction request time,

Endpoint's transfer direction (IN-OUT), and Endpoint number (1-15), the specific

Endpoint Buffer can be selected, but EP Buffer data width (DW) must reside within

the Configuration ROM.

2.2.2. Control Core The Control Core is the main Finite State Machine (FSM), handling all Device

operations.

At system boot time, the DMAC enters its Idle State, ready to carry data transfers.

It operates under the PE control.

The Control Core translates PE requests to data transfer actions, according to the

information stored in the Configuration ROM. PE initiates Data transfer operation by

Transfer Request signal assertion and Endpoint info (4bit EP number + 1bit in/out).

PE requests are being transferred to the Control Core. The Control Core employs the

pre-programmed EP’s Buffer Data Width (DW) information, to perform either a

flyby transaction or a gather-scatter transaction. With each bus data transfer, the value

in the current address counter is driven onto the address bus, and the current address

counter is automatically incremented. At transaction completion (single or multiple

Control

Core (FSM)

Configuration

ROM

DIU System

Bus Protocol

Engine

30

packets), address counter is cleared. Address counter increment or reset at transaction

completion is performed under the FSM as well as DIU's Gather-Scatter Registers

read & write. When PE issues transaction request signal, the DMAC responds with an

Acknowledge signal to the PE and when the PE issues packet transfer request, the

DMAC starts transfer data as requested. The Control Core issues the required control

signals for both the EP Buffer and the PE RX/TX FIFOs, in the correct sequence, to

perform either a flyby or gather-scatter data transfer operations (issue Read/Write

control signals and monitor Wait signal and increments address counter, as long as

the data transfer is carried on.

When the last data byte has been received or sent from/to the PE Packet Buffer, the

PE negates the DMA Request signals.

2.2.3. Data Interface Unit (DIU) The DIU contains the temporary Registers for the gather- scatter transfer operation

and their control logic. It also contains the 3'S Data Buffers to manage the bi-

directional data flow to/from the AGE Buffers and to/from the PE RX/TX FIFOs.

Gather-Scatter Registers are used for data transfers between different source &

destination data width, i.e. PE RX/TX FIFOs (I/O), is always 16bit wide, while

Functions Endpoints Buffers can be 16/24/32/48bits wide. 24bit transfers must be

performed on even number of transfers.

There are 4 gather-scatter registers: R1 (16bit), R2 & R3 (8bit), R4 (16bit).

Gather-Scatter operation:

- IN (from EP to PE TX FIFO)

32bit: Read 32bit word into R1-R2-R3. Write 2 16bit words from R1 & R2+R3.

24bit: Read 2 24bit words into R1+R2 & R3+R4. Write 3 16bit words from R1,

R2+R3 & R4.

48bit: Read 48bit word into R1-R2-R3-R4. Write 3 16bit words from R1, R2+R3 &

R4.

- OUT (from PE RX FIFO to EP)

32bit: Read 2 16bit words into R1 & R2+R3. Write 32bit word from R1-R2-R3.

24bit: Read 3 16bit words into R1-R2-R3-R4. Write 2 24bit words from R1+R2 &

R3+R4.

48bit: Read 3 16bit words into R1-R2-R3-R4. Write 48bit word from R1-R2-R3-R4.

2.3. Interfaces

Signal Name Signal Type Description

dbus[47:0] Bi-directional Data Bus. These pins serve as input and output System data bus

(for local µC, PE Packet Buffers and Application Buffers

abus[23:0] Bi-directional Address Bus. Serves as System Address Bus for the DMAC.

16 LSBs are used by the µC to access the Control Registers.

nrd Bi-directional System Read signal issued by Bus Masters (DMAC or µC)

R4 R1

47 32 31 24 23 16 15 0

R3 R2

31

nwr Bi-directional System Write signal issued by Bus Masters (DMAC or µC)

npbrd Out Read signal for PE during data transfers. Active low.

npbwr Out Write signal for PE during data transfers. Active low.

nwait In-Active low Used to extend bus cycle for slow Application Memories.

ndack In-Active low DMAC Acknowledge to PE Transfer Request.

ntreq In-Active low DMA Transaction Request signal from Protocol Engine.

npreq In-Active low DMA Packet Request signal from Protocol Engine.

neot In-Active low End-Of-Transaction signal, issued by the Function

epn[3:0] In-Active Hi Endpoint number (1-15) for requested data transfer.

ep_dir In-Active Hi Endpoint Direction IN (1) or OUT (0) for requested data transfer.

clk Input Oscillator input. Connected to an External Oscillator.

nrst In-Active low Reset. External asynchronous static reset.

Vcc Input Internal Power Source (derived from USB V+, via LDO)

Vss Input Internal Power Ground (derived from USB V-, via LDO)

Note: DMAC uses Endpoint number (epn[3:0]) and Transaction direction (ep_dir), as

internal ROM address, to perform the expected data transfer to/from the specific End Point

@ the Function Core. They serve as chip selects for the Buffers within the Function Cores.

DMAC also issues nrd/nwr, npbrd/npbwr signals and current EP Buffer address, to handle

data transfer. Control signals npbrd or npbwr are used by the PE to drive RX FIFO output

data onto the system data bus or to latch the data from the system bus to the TX FIFO,

depending on transfer direction.

2.4. Programming Model

2.4.1. Configuration ROM The ROM holds the configuration data. A single function within a single

configuration, having up to 15 OUT and 15 IN Endpoints is supported.

A set of 15 OUT & 15 IN Endpoints information is pre-programed in the ROM.

Information for each Endpoint includes:

- EP Data Width (DW in 16bit words) - 2bits/EP [00-16bit, 01-32bit, 10-2x24bit, 11-

48bit].

Default - 00.

Data per EP 2bits. 16bits ROM Word holds DW info of 8 Endpoints.

15 EPs/Dir/Function information, are stored in 2 16bits ROM Words.

Total number of Configuration ROM size is 2 x 2 = 4 16bit ROM Words.

Individual ROM Words are accessed via internal 2-bit address bus.

2.4.2. ROM Words Data Formats

Fn_EPn_I/O Registers – EP Data Size (Data Width units) MSB

LSB D15 D14 D13 D12 D11 D10 D9 D8 D7 D6 D5 D4 D3 D2 D1 D0

|-----EP8-----|-----EP7--------|------EP6-----|-----EP5----|-----EP4-----|-----EP3-----|-----EP2------|----

EP1----|

32

D15 D14 D13 D12 D11 D10 D9 D8 D7 D6 D5 D4 D3 D2 D1 D0

|----EP15------|-----EP14----|----EP13-----|----EP12----|----EP11----|----EP10----|----EP9-

-----|

2.4.3. Configuration ROM Words List

Address Register Name Register Description

0h F1_EP1_8_O Function #1, Output EPs 1-8 DW

1h F1_EP9_15_O Function #1, Output EPs 9-15 DW

2h F1_EP1_8_I Function #1, Input EPs 1-8 DW

3h F1_EP9_15_I Function #1, Input EPs 9-15 DW

3. Implementation

The DMAC is designed as a Front-End for near future ASIC implementation. It is designed

using Verilog HDL and simulated/logically verified for correct operation, using Cadence

Incisive Simulator.

Intermediate Hardware Implementation, for proof of concept and correct functionality, is

performed using FPGA Device, located on Altera DE2 Development Board, under Quartus II

Development Environment. The Incisive logically verified Verilog code is used for

implementation.

Quartus II MegaFunction Wizard is not used.

There is an option to incorporate the Protocol-Aware DMAC into the USB2.0 Protocol Engine.

33

4. USB2.0 Device System Diagram

USB2.0 Protocol

Engine

USB2.0-Aware

DMAC

USB2.0 PHY

Data Bus

Address Bus

dbus[47:0]

abus[23:0]

dbus[15:0]

abus[15:0]

USB

Connector

UTMI

XTAL

Oscillatorclk

clk

+V -V

+D -D

CLK

RST

nrst

nrst

nrst clk

ntreq

ntreq

ndack

epn[3:0]

ep_dir

ep_dir

epn[3:0]4

LDO

+Vcc -Vss

+Vcc -Vss

+Vcc -Vss

+Vcc -Vss

nwr nrd

nwr nrd

nwr

nrd

Function

Core 0

dbus[m:0]

abus[n:0]

nrst clk

epn[3:0]

nwr nrd

+Vcc -Vss

npbwr npbrd

npbwrnpbrd

nwait

nwait

ndack

npreq

npreq

neot

neot

Fig. 3 USB2.0 GOK Device Controller System Diagram

4.1. DMA Transfer Types and Modes:

4.1.1. Flyby DMA transfer

The fastest DMA transfer type is referred to as a single-cycle, single-address, or flyby

transfer.

In a flyby DMA transfer, a single bus operation is used to accomplish the transfer,

with data read from the source and written to the destination simultaneously. In flyby

operation, the device requesting service (PE) asserts a DMA request on the

appropriate channel request line of the DMAC (specific Function Endpoint). In

response, the DMAC issues acknowledge signal to the requesting device (PE), and

start the data transfer by issueing the appropriate control signals and Endpoint buffer

start address (0). This signal alerts the requesting device to drive the data onto the

system data bus or to latch the data from the system bus, depending on the direction of

the transfer. In other words, a flyby DMA transfer looks like a memory read or write

cycle with the DMAC supplying the address and the I/O device reading or writing the

data. Because flyby DMA transfers involve a single memory cycle per data transfer,

34

these transfers are very efficient; however, memory to-memory transfers are not

possible in this mode.

4.1.2. Gather-Scatter DMA Transfer

This type of transfer is useful for interfacing devices with different data bus sizes. The

DMA employs a multiple-cycle, multiple-address data transfers, called Gather-Scatter

transfer.

The data being transferred is first read from the I/O device or memory into a

temporary DMA internal data registers. The data is then written to the memory or I/O

device in the next cycles.

This device has only single address counter and hence supports only memory-to- I/O

transfers.