california state university, northridge high-throughput

92
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput, Lossless Data Compression and Decompression On FPGAs A graduate project submitted in partial fulfillment of the requirements for the degree of Masters of Science in Electrical Engineering. By Vikas Udayashekar in collaboration with Spoorthi Suresh May 2012

Upload: others

Post on 13-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

High-Throughput, Lossless Data Compression and Decompression

On FPGAs

A graduate project submitted in partial fulfillment of the requirements

for the degree of Masters of Science

in Electrical Engineering.

By

Vikas Udayashekar

in collaboration with

Spoorthi Suresh

May 2012

Page 2: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

ii

The graduate project of Vikas Udayashekar is approved:

________________________________________________ ____________

Dr. Somnath Chattopadhyay Date

_________________________________________________ ____________

Dr. Ahmad Sarfaraz Date

_________________________________________________ ____________

Dr. Ramin Roosta, Chair Date

California State University, Northridge

Page 3: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

iii

ACKNOWLEDGEMENT

The satisfaction and euphoria that accompanies the successful completion of any

task would be incomplete without the mention of the people who made it possible and

whose constant encouragement and guidance has been a source of inspiration throughout

the course of the project.

We express our sincere gratitude to Dr. Ramin Roosta, our project committee

chairperson. His invaluable assistance is one the main reasons that this project has been

successfully completed. We also wish to thank the other members of our graduate project

committee, Dr.Somnath Chattopadhayay and Dr. Ahmad Sarfaraz for their suggestions

and support. We would like to extend our profound gratitude to our Department Chair Dr.

Ali Amini for facilitating and helping us.

It is by God’s grace and the continuous support of our parents and friends that we

have been able to complete our MS program. Our family’s invaluable support in

providing us with a high quality of education has helped us achieve our goals. We want

to also express our appreciation to the Electrical and Computer Engineering Department

at California State University, Northridge, including all the professors whose classes we

had the pleasure to take.

Page 4: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

iv

TABLE OF CONTENTS

SIGNATURE PAGE………………………………………………………………...…. ii

ACKNOWLEDGEMENT……...………………………….………………………..…. iii

LIST OF FIGURES…….....……..…………………………………………..…….......vi

ABSTRACT………………………..……………………….…………………………...vii

CHAPTER 1 INTRODUCTION………….………………………………………….....1

1 Introduction and Background………………...…………...………..…..1

1.1 How does compression work? ………...…………………….2

1.2 Text and Signals: lossless and lossy compression…………...2

CHAPTER 2 824B ALGORITHM………………………………………………..…... 4

2.1 Introduction……………………………………………………….…4

CHAPTER 3 824B FPGA DESIGN…………………………………………………....6

3.1 Introduction ………………………………………...………………..6

3.2 FPGA Compression Pipeline……………………………..…………..6

3.2 FPGA Decompression Pipeline ……………………….…………….10

CHAPTER 4 SOFTWARE LANGUAGE/ HARDWARE IMPLEMENTATION…...12

4.1 Programmable Devices………………………………….…………...12

4.1.1 Programmable Logic Devices …………….……………..12

4.1.2 Complex Programmable Devices ……………………….13

4.1.3 Field Programmable Gate Arrays(FPGA)………..……...14

4.1.3.1 Advantages of FPG……………………….…...16

4.2 Hardware Design and Development………………………………..17

4.2.1 Design Entry………………………………………….....17

Page 5: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

v

4.2.2 Synthesis………………………….……………….…...18

4.2.3 Simulation…………………………………….……......18

4.2.4 Implementation………………………….……………..18

4.2.4.1 Translate…………………………….……......18

4.2.4.2 Map…………………………………………..18

4.2.4.3 Place and Route……………………………...19

4.3 Device Programming……………………………..……………... 19

4.4 Verilog HDL……………………………………………………...20

4.4.1 Importance of HDLs…………………..………………20

4.4.2 Why Verilog? ………………………………….……...20

CHAPTER 5 Result And Discussion………………………………………………..22

Verification in Modelsim(Xilinx)………….…………………………….…..22

Compression………….……………………………………………….……..22

Decompression………….……………………………….…………………..30

HDL Synthesis Report- Compression.….……………….…………………..38

HDL Synthesis Report- Decompression.………………….………….……..39

CHAPTER 6 Conclusion……………………………………………………………40

REFERENCE..………………………………………………………..…………….41

APPENDIX …………………………………………………………………….…..42

Page 6: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

vi

LIST OF FIGURES

Figure 1: 8 Byte Input Split into 7 Phrases ……………….………………………...…5

Figure 2: FPGA Compression Pipeline……………………………………………….. 7

Figure 3: FPGA Decompression Pipeline ……………………………………………. 10

Figure 4: Internal Structure of a CPLD……………………………………………….. 13

Figure 5: Internal Structure of an FPGA ………………...…………………………… 15

Figure 6: Internal architecture of CLB …………………………………...…………... 16

Figure 7: Design Flow ……………………………………………………………..…. 17

Figure 8: Simulation result 1 for Compression................................................................22

Figure 9: Simulation result 2 for Compression …………………….…………..…….. 23

Figure 10: Simulation result 3 for Compression ………………….….………………..24

Figure 11: Simulation result 4 for Compression …………………….………………...25

Figure 12: Simulation result 5 for Compression ….…………………………………..26

Figure 13: Simulation result 6 for Compression..……………………………..……….27

Figure 14: Simulation result 7 for Compression.……………………………………....28

Figure 15: Simulation result 8 for Compression …………………………….…..…….29

Figure 16: Simulation result 1 for Decompression ………………………...………….30

Figure 17: Simulation result 2 for Decompression.………………………..…………..31

Figure 18: Simulation result 3 for Decompression …………………………………....32

Page 7: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

vii

Figure 19: Simulation result 4 for Decompression ……………………………………33

Figure 20: Simulation result 5 for Decompression ……..……………….………….....34

Figure 21: Simulation result 6 for Decompression …………………………...…….....35

Figure 22: Simulation result 7 for Decompression …………………………..…….....36

Figure 23: Simulation result 8 for Decompression ………………………..……….....37

Page 8: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

viii

ABSTRACT

High-Throughput, Lossless Data Compression and Decompression

on FPGAs

By

Vikas Udayashekar

Masters of Science in Electrical Engineering

Before writing data to a storage medium or transmitting across a transmission medium

lossless compression is often used. Storage space or transmission bandwidth is saved by

compression; when the data is subsequently read a decompression operation is

performed.

Though this scheme has clear benefits, the execution time of compression and

decompression is critical to its application in real-time systems. Software compression

utilities are often slow, leading to degraded system performance. Hardware-based

solutions, on the other hand, often drive large resource requirements and are not

amenable to supporting future algorithmic changes.

We present a high-throughput, streaming, lossless compression algorithm and its

efficient implementation on FPGAs. A peak throughput of 1GB/sec per engine, with a

sustained overall measured throughput of 2.66GB/sec on a PCIe-based FPGA board with

compression and decompression engines is provided by the proposed solution. An overall

speedup of 13.6x over reference software implementation is represented by this result.

With multiple engines running in parallel, the proposed design provides a path to

potential speedups of up to two orders of magnitude. The achievable overall throughput is

limited only by the available PCIe bus bandwidth in the current implementation.

Page 9: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

1

CHAPTER 1: INTRODUCTION

1. Introduction And Background

To save storage space or reduce required transmission bandwidth lossless data

compression is often used. In software, data compression algorithms are often

implemented. It can sometimes be a performance bottleneck, although this approach

saves important real estate on processor chips and allows for later modifications to the

algorithm. Aiming for the entropy of the data, most of the existing work on data

compression has been concentrated on achieving the best compression efficiency

possible. However, execution speed of the compression / decompression operation is

more important than the compression efficiency in a number of applications. In such

applications hardware-based fast compression algorithms may be used. Many such

algorithms like custom hardware based ALDC [1], MXT [2], 842[3], and FPGA based

XMatchPRO[4] exit. However, most of these solutions utilize expensive CAM (content-

addressable memory) structure for implementing the history windows (dictionaries) and

achieve throughputs in the range of 100 MB/sec to 400 MB/sec. In this design, we

present the 842B algorithm, a hardware-friendly, lossless compression algorithm derived

from the original 842 algorithm [3], and its FPGA implementation. Instead of expensive

CAMs, the proposed algorithm uses hashing-based dictionary lookups and offers a

throughput of 1GByte/sec per engine. Compression and decompression of arbitrary size

data blocks is allowed by the low latency streaming architecture and can be placed

directly on the transmission channels. Additionally, better compression efficiency is yield

by multiple overlapping sliding compression windows (dictionaries) of different lengths.

Very modest FPGA resources are required as the compressor and decompressor designs

presented here are very lean. Therefore, in application areas the designs are suitable for

use as small modules where FPGA-based systems can be applied, including signal and

image processing, network routers and transmitter-receiver systems, effectively

increasing the CPU cycles and bandwidth resources available for other purposes in such

systems.

Page 10: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

2

1.1 How does compression work?

Compression relies on the fact that the data is redundant, that till some extent it was

generated following some rules and that we can learn those rules, and thus predict

accurately the data. A compressor can reduce the size of a file by deciding which data is

more frequent and assigning it less bits than to less frequent data. Clearly compression

has two parts: one guess which are the most frequent symbols, and other which outputs

the "decision" of the first one.

1.2 Text and signals: lossless and lossy compression

We have seen before that we may want to compress different kinds of data such as text,

data bases, binary programs, sound, image and video. In practice we distinguish about

text compression and signal compression. We do this separation because data bases, and

binary programs have the same characteristic as text. Likewise sound, image and video

are signals and thus share properties. In the other hand text and image data have nothing

in common, and that's what they don't belong to the same group.

We also do this separation because for these two groups we use different kinds of

compression. That comes from the nature of the data. Digital signals are an imperfect

representation of an analogic signal, thus when compressing them we can discard some of

the information to achieve more compression. This is done with transformation and

quantization algorithms.

Let's say we have a byte from an image, its value is 65 and it represents the quantity of

red in a given pixel. If when decompressing this byte is 66 we wouldn't notice the

difference between red and a very little more of red. However if that was a text file 65

would be 'A' (assuming it's Ascii), and there's a big difference if we decompress 66 which

would be a 'B' instead of 'A'. Due to the nature of text we can't afford errors.

So we use lossless compression for text where the original file must be exact bit per bit to

the original one. And lossy compression for signals where some error is acceptable and in

most of the cases is not detected. However you should note that signals can be lossless

compressed, though then the compression achieved is far worse than with lossy

Page 11: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

3

compression. In most of the cases signal compression goes of discarding as much data as

possible but retaining as much quality as possible.

Page 12: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

4

CHAPTER 2: 842B ALGORITHM

2.1 Introduction

Repeating patterns of size 8, 4 and 2 bytes in the input data stream is identified by the

842B algorithm and 6 to 8 bit pointers replaces them to previously seen data. The

algorithm follows the same principles as the original 842 algorithm [3]. Every 8-byte

chunk of the input data is divided into 7 phrases (Figure 2-1) which are compared against

previously seen phrases. The input phrases for lookup in processing subsequent inputs are

stored using dictionaries . The address of the phrase in the dictionary (pointer) is stored in

a hash table, at a location given by the hash value of the phrase for constant-time phrase

look-up. The 7 sub-phrases of the 8B input are hashed into 7 keys, which are used to read

the pointers from the hash tables during compression. 7 phrases are read from the

dictionaries and compared against the input sub-phrases using these pointers. Indicating

the composition of the compressed data, the compressed output is generated as the

smallest possible combination of the pointers to the data. Decoding the template and

extracting different pointers and raw phrases from the compressed data, reading the

remaining phrases from the dictionaries and reconstructing the uncompressed data

decompression is involved. Reconstructing the dictionary contents is required for reading

the phrases from the dictionaries. By simply writing the post-decompression phrases back

into the dictionaries the dictionary is reconstructed on the fly, much as in the compression

operation. Note that during decompression since the pointers are already present in the

compressed data no hashing and no hash tables are required. The 842B algorithm uses

three separate dictionaries, representing three different sliding history windows, one for

each of 8, 4 and 2 byte phrases, unlike the original 842 algorithm which uses a single

phrase dictionary. The three dictionaries redundantly store the 7 sub phrases of the

current input.

Page 13: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

5

Figure 1:8-Byte Input split into 7 phrases

Redundant storage of data has two benefits, while this would seem wasteful. Multi-port

RAM arrays are expensive to implement as port count increases. A 7-port RAM array is

replaced with 1, 2 and 4 port RAM arrays for the 8, 4 and 2 byte phrases, respectively,

trading off register ports with RAM capacity. Once we make that tradeoff, the optimal

dictionary sizes can be chosen for each of the 3 phrase lengths independently. The 842B

algorithm also incorporates performance enhancers such as detecting and replacing

multiple consecutive 8-byte repeats with just a 5 bit template and a repeat count in

addition to the basic phrase comparisons. As a special case of repeats long strings of

zeros are also detected.

Page 14: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

6

CHAPTER 3: 842B FPGA DESIGN

3.1 Introduction

For the 842B compressor and decompressor, we present the FPGA design. Multi-

stage pipelines are used to design both the compressor and the decompressor. The

pipelines are used to stream the input data which process one block of input per cycle.

The compression operation is feed-forward and lends itself well to pipelining. On the

other hand, the dictionary write-back in the decompression pipeline introduces a feedback

loop and results in multiple data hazard conditions.

As stated earlier, the 842B algorithm operates on 7 different phrases for each 8-byte

input. Despite the fact that these phrases are independent of each other and can be

processed in parallel, software 842B algorithm implementations process these 7 phrases

sequentially. Our FPGA pipelines include seven parallel data-paths, one for each phrase

to exploit this parallelism and achieve improved performance.

3.2 FPGA Compression Pipeline

Different stages of the compression pipeline as implemented in the FPGA are shown in

Figure 2. 8 bytes of input per cycle are taken by the compression pipeline and outputs one

compressed data word and a template.

Every 8-byte input is broken into 7 phrases; the hashing, hash-table look-up, dictionary

look-up and phrase comparison for all these phrases are performed in parallel and a 7-bit

match/mismatch status is generated. The pointers and the raw input phrases in the

smallest possible output, based on the match/mismatch status are encoded by the encoder.

A template to indicate the composition of the compressed output is also generated. The

mapping from the match/mismatch status to the corresponding smallest output

combination (and thus the template) is got statically. Using look-up tables this mapping is

implemented on the FPGA. The 5-bit template can be read directly from the look-up

table, given a match/mismatch status.

Page 15: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

7

Figure 2: FPGA Compression Pipeline

The compression pipeline also performs hash table and dictionary writes in addition to

the above operation. The next dictionary address, generated sequentially using a counter,

is stored in the hash table, at a location given by the hash value of the phrase during the

hash table read/write cycle. Pointers may be overwritten with the latest pointer hashing to

the same location since many input phrases might hash to the same value. The

implementation is analogous to a direct mapped cache memory. This dictionary and hash

table-based design uses regular RAM arrays instead of the CAMs typically used in other

hardware compressors and hence is more area-efficient and simpler to implement.

The input phrase is written in the dictionary at a location given by the output of the

counter during the dictionary read/write cycle. Note that there is a need for up to 4

simultaneous reads and 4 writes from and to the memory banks since all 7 phrases are

being processed simultaneously. We duplicate the dictionaries to support multiple read

ports since the FPGA block RAMs that are used to implement the dictionaries and the

hash tables provide only two read/write ports. This increases the memory requirements

four-fold for a 4-port dictionary. Simply by performing a single wide write operation

multiple writes are supported and thus do not demand further dictionary replication.

The compressor’s performance depends on many factors. The first is the dictionary

size, which represents the amount of data that is “remembered”. The larger the dictionary,

the more phrases are remembered, and hence, the higher the probability of finding a

phrase match. On the other hand, a larger dictionary requires longer pointers, which in

turn increases the size of the compressed data. Larger dictionaries also require higher

FPGA resource. Thus, there exists a tradeoff between allocated hardware resources and

algorithm performance. A dictionary size-performance sweet spot which yields the best

Page 16: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

8

average compression ratio is indicated by our simulations. This sweet-spot occurs at

different dictionary sizes for different phrase sizes. The dictionary size is 2KB, 2KB and

512B for the 8-byte, 4-byte and 2-byte phrase dictionary respectively in our FPGA

design.

Dictionaries also include a wrap-around mechanism, they wrap to the beginning

when they get filled. This behavior represents windows of sizes that are equal to the

dictionary sizes sliding over the data to be compressed. This wraparound approach results

in better compression than the one which flushes the dictionaries once filled.

The hashing scheme and the sizes of the hash tables is another factor affecting

the compression efficiency. To achieving good compression performance, efficient and

effective hashing is one of the keys. A hashing function is generates the address of the

hash table location where the pointer to a dictionary entry is stored. Dictionary pointers

may be overwritten since hashing is a many-to-one function, i.e. many different phrases

hash to the same value. It is thereby important to have a hashing scheme that spreads the

hashes evenly across the entire hash table.

The hashing scheme must be (i) lightweight (requiring few resources) and (ii)

simple, enabling achievement of high frequency for efficient hardware implementation.

The simplest hashing scheme is modulo operation which simply involves selecting the

lower order bits of the input phrase as the hash. This scheme, however, results in poor

hash quality, thereby creating multiple conflicts in certain table locations while leaving

the rest untouched.

Creating XOR trees by bitwise ANDing the input phrase with a constant and

XORing together the bits of the result to generate one bit of the hash is involved in our

scheme. Generating an N bit hash out of an M bit input an N M-bit constants is required.

A total of MxN 2-input AND operations and N(M-1) 2-input XOR operations is required

for hashing. Experimentally the optimal values of the constants are determined and hard-

coded into the FPGA, thus reducing the AND operations to simple bit selection. Using an

XOR tree, the selected bits are then XORed.

For effective hashing, the selection of the appropriate hash constants is critical.

We use the Random Invertible Binary Matrix (RIBM) approach to generate the XOR

tree’s hash constants [5, 6]. The Random Invertible Binary Matrix is produced off-line by

Page 17: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

9

filling randomly with 1s and 0s and checking for invertibility to ensure maximum

dispersion in the output bits.

Hash conflicts occur, even with very good hashing techniques, overwriting a

current pointer with a new one. Conflicts result in “forgetting” a previously written

phrase as the pointer to that phrase is lost. We chose to use a large direct mapped

organization for its simplicity though an N-way set associative hash table could be used

to increase the hash hit rate. The probability of pointers being overwritten is reduced by

larger hash tables and hence increases the chances of finding a previously written phrase

in the dictionary. Having larger hash tables, however, requires larger FPGA resource.

Hash table sizes can be optimized against performance like regular caches and hence

increasing the table size beyond a certain point yields diminishing performance gains. A

hash table with roughly 4 times the number of entries in the corresponding dictionary

achieves good performance for our design.

Pattern encoding for our design is shown in following table extracted from C- pack

compression and decompression algorithm [7].

Table: Pattern encoding table for compression and decompression

Page 18: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

10

3.3 FPGA Decompression Pipeline

Figure 3 shows the 842B decompression pipeline in the FPGA. One set of

compressed data and its template is fed to the pipeline per cycle. The template is decoded

and the various pointers and raw phrases are extracted from the compressed data by the

data decoder. These extracted pointers are used to read the phrases from the three

dictionaries.

Figure 3: FPGA Decompression pipeline

The 8-byte uncompressed data is generated by the data generator by selecting each

2-byte phrase from one of four sources, namely, the extracted phrases or data read from

one of the three dictionaries. This module, thus, simply contains four 4:1 multiplexors.

The select lines for these multiplexors are read directly from a look-up table using the 5-

bit template much as in the compressor design.

The decompressed output is written into the three dictionaries to reconstruct the

dictionaries on the fly. The dictionary write-back introduces a feedback path in the

pipeline, which leads to possible data hazards. In other words, compressed data might

contain a pointer to a dictionary location which has not been written yet. A data hazard

can be led by four possible scenarios. The first is where a pointer points to a phrase that

arrived one cycle earlier (1-ahead). In the above case, dictionary data being read is still in

the data-gen stage and has not yet been written into the dictionary. This data, however, is

required in the data-gen stage in the next cycle, and hence can be forwarded. This

situation is detected and the data is forwarded appropriately by adding a hazard detection

and data forwarding unit in the data-gen stage. Since there are three separate dictionaries,

three forwarding units are required.

When the read and write requests from the same address arrive during the same

cycle (4 ahead) or a read request arrives 1 or two cycles earlier than the write request (3-

Page 19: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

11

ahead and 2-ahead respectively), the other three data hazards occur. 4-ahead hazard is a

true read-during-write condition whereas 2-ahead and 3-ahead hazards occur due to

pipelining as a consequence of the read/write operations requiring more than 1 cycle. A

dictionary bypass unit is added to address these hazards at the output of each dictionary,

which bypasses the dictionary and forwards the dictionary write data as the response to

the read request.

The hazard detection logic in the decompressor could have been avoided by

disallowing near pointers, i.e. pointers between phrases less than or equal to 4 cycles

apart, during compression. This approach would have yielded simpler pipeline logic but

then reduced compression efficiency.

Page 20: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

12

CHAPTER 4: SOFTWARE LANGUAGE/HARDWARE IMPLEMENTATION

This chapter gives details of Programmable Logic Devices and Verilog HDL.

Programmable devices like PLD, CPLD and FPGA are explained. At the end, history of

Verilog HDL, importance of HDLs and advantages of Verilog HDL are discussed.

4.1 Programmable Devices

Programmable devices are those devices which can be programmed by the user.

Various programmable devices are PLDs, CPLDs, ASICs and FPGAs

4.1.1 Programmable Logic Devices

At the low end of the spectrum are the original Programmable Logic Devices

(PLDs). A programmable logic device is an IC that is user configurable and is capable of

implementing logic functions. These were the first chips that could be used to implement

a flexible digital logic design in hardware. In other words, one could remove a couple of

the 7400-series TTL parts (ANDs, ORs, and NOTs) from the board and replace them

with a single PLD. Other names for this class of device are Programmable Logic Array

(PLA), Programmable Array Logic (PAL), and Generic Array Logic (GAL).

PLDs have several clear advantages over the 7400-series TTL parts that they

replaced. First, of course, is that chip requires less board area, power, and wiring.

Another advantage is that the design inside the chip is flexible, so a change in the logic

doesn't require any rewiring of the board. Rather, simply replacing that one PLD with

another part that has been programmed with the new design can alter the decoding logic.

Inside each PLD is a set of fully connected macro cells. These macro cells are typically

comprised of some amount of combinatorial logic (AND and OR gates) and a flip-flop.

In other words, a small Boolean logic equation can be built within each macro cell.

Hardware designs for these simple PLDs are generally written in languages like ABEL or

PALASM (the hardware equivalents of assembly) or drawn with the help of a schematic

capture tool.

Page 21: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

13

4.1.2 Complex Programmable Devices

As chip density is increased, it was natural for the PLD manufacturers to evolve

their products into larger (logically, but not necessarily physically) parts called Complex

Programmable Logic Devices (CPLDs). For most practical purposes, CPLDs can be

thought of as multiple PLDs (plus some programmable interconnect) in a single chip. The

Larger size of a CPLD allows implementing either more logic equations or a more

complicated design. In fact, these chips are large enough to replace dozens of those 7400-

Series parts.

Figure 4 contains a block diagram of a CPLD. Each of the four logic blocks

shown is equivalent to one PLD. However, in an actual CPLD there may be more or less

than four logic blocks. These logic blocks are themselves comprised of macro cells and

interconnect wiring, just like an ordinary PLD.

Figure 4: Internal Structure of a CPLD

Unlike the programmable interconnect within a PLD, the switch matrix within a

CPLD may or may not be fully connected. In other words, some of the theoretically

possible connections between logic block outputs and inputs may not actually be

supported within a given CPLD. The effect of this is most often to make 100% utilization

Page 22: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

14

Of the macro cells very difficult to achieve. Some hardware designs simply won't fit

within a given CPLD, even though there are sufficient logic gates and flip-flops

available.

Because CPLDs can hold larger designs than PLDs, their potential uses are more

varied. They are still sometimes used for simple applications like address decoding, but

more often contain high-performance control-logic or complex finite state machines. At

the high-end (in terms of numbers of gates), there is also a lot of overlap in potential

applications with FPGAs. Traditionally, CPLDs have been chosen over FPGAs whenever

High-performance logic is required. Because of its less flexible internal architecture, the

delay through a CPLD (measured in nanoseconds) is more predictable and usually

shorter.

4.1.3 Field Programmable Gate Arrays (FPGA)

'Field Programmable' means that the FPGA's function is defined by a user's

program rather than by the manufacturer of the device. A typical integrated circuit

performs a particular function defined at the time of manufacture. In contrast, a program

written by someone other than the device manufacturer defines the FPGA’s function.

Depending on the particular device, the program is either ’burned’ in permanently or

semi-permanently as part of a board assembly process, or is loaded from an external

memory each time the device is powered up. This user programmability gives the user

access to complex integrated designs without the high engineering costs associated with

application specific integrated circuits (ASIC). The FPGA is an integrated circuit that

contains many (64 to over 10,000) identical logic cells that can be viewed as standard

components. The individual cells are interconnected by a matrix of wires and

programmable switches.

The logic cell architecture varies between different device families. Generally

speaking, each logic cell combines a few binary inputs (typically between 3 and 10) to

one or two outputs according to a Boolean logic function specified in the user program.

The cell's combinatorial logic may be physically implemented as a small look-up table

memory (LUT) or as a set of multiplexers and gates. LUT devices tend to be a bit more

flexible and provide more inputs per cell than multiplexer cells at the expense of

propagation delay.

Page 23: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

15

Figure 5: Internal Structure of an FPGA

The development of the FPGA was distinct from the PLD/CPLD evolution. There

are three key parts of its structure: logic blocks, interconnect, and I/O blocks. The I/O

blocks form a ring around the outer edge of the part. Each of these provides individually

selectable input, output, or bi-directional access to one of the general-purpose I/O pins on

the exterior of the FPGA package. Inside the ring of I/O blocks lies a rectangular array of

logic blocks. The wire connecting logic block to logic blocks and I/O to logic block is

called as programmable inter connect.

Page 24: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

16

Figure 6: Internal Architecture of CLB

The logic blocks within an FPGA can be as small and simple as the macro cells in

A PLD (a so-called fine-grained architecture) or larger and more complex (coarse-

grained).

However, they are never as large as an entire PLD, as the logic blocks of a CPLD are.

The logic blocks of a CPLD contain multiple macro cells. But the logic blocks in an

FPGA are generally nothing more than a couple of logic gates or a look-up table and a

flip-flop.

4.1.3.1 Advantages of FPGA

Because of all the extra flip-flops, the density is higher from several thousand

gates to few million gates and the architecture of an FPGA is much more flexible than

that of a CPLD. This makes FPGAs better in register-heavy applications. They are also

often used in place where the processing of input data streams must be performed at a

very fast pace. In addition, FPGAs are usually denser (more gates in a given area) and

cost less than CPLD, so they are the best choice for larger logic designs. FPGA’s uses

static memory so they are reprogrammable.

Page 25: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

17

4.2 Hardware Design and Development

A description of the hardware's structure and behavior is written in a high-level

hardware description language (usually VHDL or Verilog) and that code is then compiled

and downloaded prior to execution. Of course, schematic capture is also an option for

design entry, but it has become less popular as designs have become more complex and

the language-based tools have improved. The overall process of hardware development

for programmable logic is shown in Figure 4.4

Figure 7: Design Flow

4.2.1 Design Entry

In the design entry process, the behavior of circuit is written in hardware

description language like VHDL or Verilog.

Page 26: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

18

4.2.2 Synthesis

First, an intermediate representation of the hardware design is produced. This step

is called synthesis and the result is a representation called a netlist. In this step, any

semantic and syntax errors are checked. The synthesis report is created which gives the

details of errors and warning if any. The netlist is device independent, so its contents do

not depend on the particulars of the FPGA or CPLD; it is usually stored in a standard

format called the Electronic Design Interchange Format (EDIF).

4.2.3 Simulation

Simulator is a software program to verify functionality of a circuit. The

functionality of code is checked. The inputs are applied and corresponding outputs are

checked. If the expected outputs are obtained then the circuit design is correct.

Simulation gives the output waveforms in form of zeros and ones. Although problems

with the size or timing of the hardware may still crop up later, the designer can at least be

sure that his logic is functionally correct before going on to the next stage of

development.

4.2.4 Implementation

Device implementation is done to put a verified code on FPGA. The various steps

in design implementation are:

Translate

Map

Place and route

4.2.4.1 Translate

Translate converts the EDIF file to the NGD (Native Generic Description File)

which means code is converted to the gates or net lists. The translate process generates

the translate report which gives the errors and warnings in translation process. This report

also gives the list of device and I/O utilization, which helps the designer to determine the

selection of best device.

4.2.4.2 Map

Mapping converts the NGD (Native Generic Description) file obtained from

translate process to the NCD (Native Circuit Description File) which means the gates are

converted to the physical components like flip flops and multiplexer.

Page 27: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

19

4.2.4.3 Place and Route

Place is the process of selecting specific logic blocks in the FPGAs where design gates

will reside. Route is the physical routing of interconnect between logic blocks. This

means that logic blocks, CLB, I/O blocks are assigned to specific locations on die and

interconnections are made between them. This step involves mapping the logical

structures described in the net list onto actual macro cells, interconnections, and input and

output pins. This process is similar to the equivalent step in the development of a printed

circuit board, and it may likewise allow for either automatic or manual layout

optimizations. The result of the place & route process is a bitstream. This name is used

generically, despite the fact that each CPLD or FPGA (or family) has its own, usually

proprietary, bitstream format. Bitstream is the binary data that must be loaded into the

FPGA or CPLD to cause chip to execute a particular hardware design.

4.3 Device Programming

Once bit stream file is created for a particular FPGA or CPLD, it is downloaded

on the device. The details of this process are dependent upon the chip's underlying

process technology. Programming technologies used are PROM (for one-time

programmable), EPROM, EEPROM, and Flash. Just like their memory counterparts,

PROM and EPROM based logic devices can only be programmed with the help of a

separate piece of lab equipment called a device programmer. On the other hand, many of

the devices based on EEPROM or Flash technology are in-circuit programmable. In other

words, the additional circuitry that's required to perform device reprogramming is

provided within the FPGA or CPLD silicon as well. This makes it possible to erase and

reprogram the device internals via a JTAG interface or from an on-board embedded

processor. In addition to non-volatile technologies, there are also programmable logic

devices based on SRAM technology. In such cases, the contents of the device are

volatile. This has both advantages and disadvantages. The obvious disadvantage is that

the internal logic must be reloaded after every system or chip reset. That means, some

sort of an additional memory chip is needed to hold the bit stream. But it also means that

the contents of the logic device can be changed.

Page 28: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

20

4.4 Verilog HDL

The history of the Verilog HDL[8] goes back to the 1980s, when a company

called Gateway Design Automation developed a logic simulator, Verilog-XL, and with it

a hardware description language. Cadence Design Systems acquired Gateway in 1989

and with it the rights to the language and the simulator. In 1990, Cadence put the

language (but not the simulator) into the public domain, with the intention that it should

become a standard, nonproprietary language. The Verilog HDL is now maintained by a

nonprofit making organization, Accellera, which was formed from the merger of Open

Verilog International (OVI) and VHDL International. OVI had the task of taking the

language through the IEEE standardization procedure. In December 1995 Verilog HDL

became IEEE Std. 1364-1995. A significantly revised version was published in 2001:

IEEE Std. 1364-2001. There was a further revision in 2005 but this only added a few

minor changes. Accellera have also developed a new standard, System Verilog, which

extends Verilog. System Verilog became an IEEE standard (1800-2005) in 2005. There is

also a draft standard for analog and mixed-signal extensions to Verilog, Verilog-AMS.

4.4.1 Importance of HDLs

HDLs have many advantages compared to traditional schematic based design.

Designs can be described at a very abstract level by use of HDLs. Designers can write

their RTL description without choosing a specific fabrication technology. Logic synthesis

tools can automatically convert the design to any fabrication technology. If a new

technology emerges, designers do not need to redesign their circuit.

Functional verification of the design can be done early in the design cycle.

Better representation of design due to simplicity of HDLs when compared to gatelevel

schematics.

Modification and optimization of the design became easy with HDLs.

Cuts down design cycle time significantly because the chance of a functional bug at a

later stage in the design-flow is minimal[8].

4.4.2 Why Verilog?

Verilog HDL has evolved as a standard hardware description language. Verilog

HDL offers many useful features for hardware design.

Easy to learn and easy to use, due to its similarity in syntax to that of the C

Page 29: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

21

programming language.

Different levels of abstraction can be mixed in the same design.

Availability of Verilog HDL libraries for post-logic synthesis simulation.

Most of the synthesis tools support Verilog HDL.

The Programming Language Interface (PLI) is a powerful feature that allows the

user to write custom C code to interact with the internal data structures of Verilog.

Designers can customize a Verilog HDL simulator to their needs with the PLI [8]

Page 30: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

22

CHAPTER 5: RESULT AND DISCUSSION

Verification in Modelsim(Xilinx)

Compression

Figure 8: Simulation result 1 for Compression

Page 31: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

23

Figure 9: Simulation result 2 for Compression

Page 32: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

24

Figure 10: Simulation result 3 for Compression

Page 33: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

25

Figure 11: Simulation result 4 for Compression

Page 34: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

26

Figure 12: Simulation result 5 for Compression

Page 35: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

27

Figure 13: Simulation result 6 for Compression

Page 36: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

28

Figure 14: Simulation result 7 for Compression

Page 37: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

29

Figure 15: Simulation result 8 for Compression

Page 38: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

30

Decompression

Figure 16: Simulation result 1 for Decompression

Page 39: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

31

Figure 17: Simulation result 2 for Decompression

Page 40: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

32

Figure 18: Simulation result 3 for Decompression

Page 41: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

33

Figure 19: Simulation result 4 for Decompression

Page 42: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

34

Figure 20: Simulation result 5 for Decompression

Page 43: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

35

Figure 21: Simulation result 6 for Decompression

Page 44: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

36

Figure 22: Simulation result 7 for Decompression

Page 45: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

37

Figure 23: Simulation result 8 for Decompression

Page 46: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

38

HDL Synthesis Report- Compression

Page 47: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

39

HDL Synthesis Report- Decompression

Page 48: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

40

CHAPTER 6: CONCLUSION

Storage space or transmission bandwidth was saved by compression; when the data was

subsequently read a decompression operation was performed. A high-throughput,

streaming, lossless compression algorithm and its efficient implementation on FPGAs

was achieved. A peak throughput on a PCIe-based FPGA board with compression and

decompression engines was provided by the proposed solution. An overall speedup over

reference software implementation was represented by this result. With multiple engines

running in parallel, the proposed design provided a path to potential speedups of up to

two orders of magnitude. The achievable overall throughput was limited only by the

available PCIe bus bandwidth in the current implementation. Our FPGA pipelines

included seven parallel data-paths, one for each phrase to exploit parallelism and achieve

improved performance. The wraparound approach resulted in better compression than the

one which flushes the dictionaries once filled.

Page 49: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

41

REFERENCE

[1] Craft, D. J. “A fast hardware data compression algorithm and some algorithmic

extensions”, IBM Journal of Research and Development, 42(6), 733 – 745, November

1998.

[2] Tremaine, R.B., et. al. “IBM Memory Expansion Technology (MXT)”,IBM Journal

of Research and Development, 45(2), 271-285, March 2001.

[3] Franaszek, P. A., Lastras, L. A., Peng, S., and Robinson, J. T., “Data Compression

with Restricted Parsings”, dcc, 203-212, Data Compression Conference (DCC'06), 2006.

[4] Núñez, J. L., et. al. "X-MatchPRO: A ProASIC-Based 200 Mbytes/s Full-Duplex

Lossless Data Compressor”, Lecture Notes in Computer Science, 2147/2001, 613-617,

January 2001.

[5] Qureshi, M. K., et. al., “Enhancing Lifetime and Security of PCMBased Main

Memory with Start-Gap Wear Leveling” 42nd International Symposium on

Microarchitecture (MICRO 2009), December 2009.

[6] Vandierendonck, H. and De Bosschere, K. “XOR-based hash functions”, IEEE

Transactions on Computers, 54(7), 800- 812, July 2005.

[7] Xi Chen, Lei Yang, Robert P. Dick, Member, IEEE, Li Shang, Member “C-Pack: A

High-Performance Microprocessor Cache Compression Algorithm ”, IEEE, and Haris

Lekatsas, August 2010.

[8] Samir Palnitkar, Verilog HDL A guide to Digital Design and Synthesis, 3rd

Edition,

SunSoft Press, 1996.

Page 50: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

42

APPENDIX

VERILOG HDL FOR COMPRESSION AND DECOMPRESSION,

VERIFICATION AND REPORT

Compression

Top module:

module top_comp_new(sel,clk,reset,dataout);

input [3:0] sel;

input clk;

input reset;

output [7:0] dataout;

wire [3:0] sel;

wire clk;

wire reset;

wire [63:0] datain;

wire [7:0] dataout;

wire [67:0] en_data_out;

wire [63:0] key8;

wire [31:0] key4_1;

wire [31:0] key4_2;

wire [15:0] key2_1;

wire [15:0] key2_2;

wire [15:0] key2_3;

wire [15:0] key2_4;

wire [3:0] addr8;

wire [3:0] addr4_1;

wire [3:0] addr4_2;

wire [3:0] addr2_1;

wire [3:0] addr2_2;

wire [3:0] addr2_3;

wire [3:0] addr2_4;

wire mis8;

wire mis4_1;

wire mis4_2;

wire mis2_1;

wire mis2_2;

Page 51: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

43

wire mis2_3;

wire mis2_4;

wire [63:0] data8;

wire [31:0] data4_1;

wire [31:0] data4_2;

wire [15:0] data2_1;

wire [15:0] data2_2;

wire [15:0] data2_3;

wire [15:0] data2_4;

wire pout8;

wire pout4_1;

wire pout4_2;

wire pout2_1;

wire pout2_2;

wire pout2_3;

wire pout2_4;

assign key8 = datain;

assign key4_1 = datain[63:32];

assign key4_2 = datain[31:0];

assign key2_1 = datain[63:48];

assign key2_2 = datain[47:32];

assign key2_3 = datain[31:16];

assign key2_4 = datain[15:0];

assign datain = 64'haabbccdd12345678;

Hash8 c0(key8,clk,reset,addr8);

Hash4 c1(key4_1,key4_2,clk,reset,addr4_1,addr4_2);

Hash2 c2(key2_1,key2_2,key2_3,key2_4,clk,reset,addr2_1,addr2_2,addr2_3,addr2_4);

dict8 c3(addr8,key8,mis8,clk,reset,data8);

dict4 c4(addr4_1,addr4_2,key4_1,key4_2,mis4_1,mis4_2,clk,reset,data4_1,data4_2);

dict2

c5(addr2_1,addr2_2,addr2_3,addr2_4,key2_1,key2_2,key2_3,key2_4,mis2_1,mis2_2,mis2_3,mi

s2_4,clk,reset,data2_1,data2_2,data2_3,data2_4);

phase_comp

c6(datain,data8,data4_1,data4_2,data2_1,data2_2,data2_3,data2_4,clk,reset,mis8,mis4_1,mis4_2

,mis2_1,

mis2_2,mis2_3,mis2_4,pout8,pout4_1,pout4_2,pout2_1,pout2_2,pout2_3,pout2_4);

Page 52: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

44

encoder c7(clk,reset,datain,pout8,pout4_1,pout4_2,pout2_1,pout2_2,pout2_3,pout2_4,addr8,

addr4_1,addr4_2,addr2_1,addr2_2,addr2_3,addr2_4,en_data_out);

assign dataout = (sel == 4'b0000) ? en_data_out[7:0]:

(sel == 4'b0001) ? en_data_out[15:8]:

(sel == 4'b0010) ? en_data_out[23:16]:

(sel == 4'b0011) ? en_data_out[31:24]:

(sel == 4'b0100) ? en_data_out[39:32]:

(sel == 4'b0101) ? en_data_out[47:40]:

(sel == 4'b0110) ? en_data_out[55:48]:

(sel == 4'b0111) ? en_data_out[63:56]:

(sel == 4'b1000) ? en_data_out[67:64]: 8'b0;

endmodule

2 byte hash:

module Hash2(key2_1,key2_2,key2_3,key2_4,clk,reset,addr2_1,addr2_2,addr2_3,addr2_4);

input [15:0] key2_1;

input [15:0] key2_2;

input [15:0] key2_3;

input [15:0] key2_4;

input clk;

input reset;

output [3:0] addr2_1;

output [3:0] addr2_2;

output [3:0] addr2_3;

output [3:0] addr2_4;

wire [15:0] key2_1;

wire [15:0] key2_2;

wire [15:0] key2_3;

wire [15:0] key2_4;

wire clk;

wire reset;

reg [3:0] addr2_1;

reg [3:0] addr2_2;

reg [3:0] addr2_3;

reg [3:0] addr2_4;

reg [19:0] hashtable2_1 [7:0];

reg [19:0] hashtable2_2 [7:0];

reg [19:0] hashtable2_3 [7:0];

Page 53: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

45

reg [19:0] hashtable2_4 [7:0];

reg match_1;

reg match_2;

reg match_3;

reg match_4;

reg [3:0] count_1;

reg [3:0] count_2;

reg [3:0] count_3;

reg [3:0] count_4;

reg [2:0] i_1;

reg [2:0] i_2;

reg [2:0] i_3;

reg [2:0] i_4;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable2_1[0] <= 20'd0;

hashtable2_1[1] <= 20'd0;

hashtable2_1[2] <= 20'd0;

hashtable2_1[3] <= 20'd0;

hashtable2_1[4] <= 20'd0;

hashtable2_1[5] <= 20'd0;

hashtable2_1[6] <= 20'd0;

hashtable2_1[7] <= 20'd0;

count_1 <= 4'd0;

i_1 <= 3'd0;

end

else if(match_1 == 1'b0)

begin

hashtable2_1[i_1] <= {key2_1,count_1};

count_1 <= count_1 + 1;

i_1 <= i_1 + 1;

end

end

always@( reset or key2_1)

begin

if(reset == 1'b1)

begin

addr2_1 = 4'd0;

Page 54: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

46

match_1 = 1'd0;

end

else if(key2_1 == hashtable2_1[0][19:4])

begin

addr2_1 = hashtable2_1[0][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[1][19:4])

begin

addr2_1 = hashtable2_1[1][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[2][19:4])

begin

addr2_1 = hashtable2_1[2][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[3][19:4])

begin

addr2_1 = hashtable2_1[3][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[4][19:4])

begin

addr2_1 = hashtable2_1[4][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[5][19:4])

begin

addr2_1 = hashtable2_1[5][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[6][19:4])

begin

addr2_1 = hashtable2_1[6][3:0];

match_1 = 1'd1;

end

else if(key2_1 == hashtable2_1[7][19:4])

Page 55: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

47

begin

addr2_1 = hashtable2_1[7][3:0];

match_1 = 1'd1;

end

else

begin

addr2_1 = addr2_1;

match_1 = 1'd0;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable2_2[0] <= 20'd0;

hashtable2_2[1] <= 20'd0;

hashtable2_2[2] <= 20'd0;

hashtable2_2[3] <= 20'd0;

hashtable2_2[4] <= 20'd0;

hashtable2_2[5] <= 20'd0;

hashtable2_2[6] <= 20'd0;

hashtable2_2[7] <= 20'd0;

count_2 <= 4'd0;

i_2 <= 3'd0;

end

else if(match_2 == 1'b0)

begin

hashtable2_2[i_2] <= {key2_2,count_2};

count_2 <= count_2 + 1;

i_2 <= i_2 + 1;

end

end

always@( reset or key2_2 )

begin

if(reset == 1'b1)

begin

addr2_2 = 4'd0;

match_2 = 1'd0;

end

else if(key2_2 == hashtable2_2[0][19:4])

begin

Page 56: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

48

addr2_2 = hashtable2_2[0][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[1][19:4])

begin

addr2_2 = hashtable2_2[1][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[2][19:4])

begin

addr2_2 = hashtable2_2[2][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[3][19:4])

begin

addr2_2 = hashtable2_2[3][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[4][19:4])

begin

addr2_2 = hashtable2_2[4][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[5][19:4])

begin

addr2_2 = hashtable2_2[5][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[6][19:4])

begin

addr2_2 = hashtable2_2[6][3:0];

match_2 = 1'd1;

end

else if(key2_2 == hashtable2_2[7][19:4])

begin

addr2_2 = hashtable2_2[7][3:0];

match_2 = 1'd1;

end

Page 57: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

49

else

begin

addr2_2 = addr2_2;

match_2 = 1'd0;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable2_3[0] <= 20'd0;

hashtable2_3[1] <= 20'd0;

hashtable2_3[2] <= 20'd0;

hashtable2_3[3] <= 20'd0;

hashtable2_3[4] <= 20'd0;

hashtable2_3[5] <= 20'd0;

hashtable2_3[6] <= 20'd0;

hashtable2_3[7] <= 20'd0;

count_3 <= 4'd0;

i_3 <= 3'd0;

end

else if(match_3 == 1'b0)

begin

hashtable2_3[i_3] <= {key2_3,count_3};

count_3 <= count_3 + 1;

i_3 <= i_3 + 1;

end

end

always@( reset or key2_3)

begin

if(reset == 1'b1)

begin

addr2_3 = 4'd0;

match_3 = 1'd0;

end

else if(key2_3 == hashtable2_3[0][19:4])

begin

addr2_3 = hashtable2_3[0][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[1][19:4])

Page 58: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

50

begin

addr2_3 = hashtable2_3[1][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[2][19:4])

begin

addr2_3 = hashtable2_3[2][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[3][19:4])

begin

addr2_3 = hashtable2_3[3][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[4][19:4])

begin

addr2_3 = hashtable2_3[4][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[5][19:4])

begin

addr2_3 = hashtable2_3[5][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[6][19:4])

begin

addr2_3 = hashtable2_3[6][3:0];

match_3 = 1'd1;

end

else if(key2_3 == hashtable2_3[7][19:4])

begin

addr2_3 = hashtable2_3[7][3:0];

match_3 = 1'd1;

end

else

begin

addr2_3 = addr2_3;

match_3 = 1'd0;

end

Page 59: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

51

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable2_4[0] <= 20'd0;

hashtable2_4[1] <= 20'd0;

hashtable2_4[2] <= 20'd0;

hashtable2_4[3] <= 20'd0;

hashtable2_4[4] <= 20'd0;

hashtable2_4[5] <= 20'd0;

hashtable2_4[6] <= 20'd0;

hashtable2_4[7] <= 20'd0;

count_4 <= 4'd0;

i_4 <= 3'd0;

end

else if(match_4 == 1'b0)

begin

hashtable2_4[i_4] <= {key2_4,count_4};

count_4 <= count_4 + 1;

i_4 <= i_4 + 1;

end

end

always@( reset or key2_4 )

begin

if(reset == 1'b1)

begin

addr2_4 = 4'd0;

match_4 = 1'd0;

end

else if(key2_4 == hashtable2_4[0][19:4])

begin

addr2_4 = hashtable2_4[0][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[1][19:4])

begin

addr2_4 = hashtable2_4[1][3:0];

match_4 = 1'd1;

end

Page 60: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

52

else if(key2_4 == hashtable2_4[2][19:4])

begin

addr2_4 = hashtable2_4[2][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[3][19:4])

begin

addr2_4 = hashtable2_4[3][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[4][19:4])

begin

addr2_4 = hashtable2_4[4][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[5][19:4])

begin

addr2_4 = hashtable2_4[5][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[6][19:4])

begin

addr2_4 = hashtable2_4[6][3:0];

match_4 = 1'd1;

end

else if(key2_4 == hashtable2_4[7][19:4])

begin

addr2_4 = hashtable2_4[7][3:0];

match_4 = 1'd1;

end

else

begin

addr2_4 = addr2_4;

match_4 = 1'd0;

end

end

endmodule

4 byte hash:

Page 61: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

53

module Hash4(key4_1,key4_2,clk,reset,addr4_1,addr4_2);

input [31:0] key4_1;

input [31:0] key4_2;

input clk;

input reset;

output [3:0] addr4_1;

output [3:0] addr4_2;

wire [31:0] key4_1;

wire [31:0] key4_2;

wire clk;

wire reset;

reg [3:0] addr4_1;

reg [3:0] addr4_2;

reg [35:0] hashtable4_1 [7:0];

reg match_1;

reg [3:0] count_1;

reg [2:0] i_1;

reg [35:0] hashtable4_2 [7:0];

reg match_2;

reg [3:0] count_2;

reg [2:0 ] i_2;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable4_1[0] <= 36'd0;

hashtable4_1[1] <= 36'd0;

hashtable4_1[2] <= 36'd0;

hashtable4_1[3] <= 36'd0;

hashtable4_1[4] <= 36'd0;

hashtable4_1[5] <= 36'd0;

hashtable4_1[6] <= 36'd0;

hashtable4_1[7] <= 36'd0;

count_1 <= 4'd0;

i_1 <= 3'd0;

end

else if(match_1 == 1'b0)

begin

hashtable4_1[i_1] <= {key4_1,count_1};

Page 62: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

54

count_1 <= count_1 + 1;

i_1 <= i_1 + 1;

end

end

always@( reset or key4_1)

begin

if(reset == 1'b1)

begin

addr4_1 = 4'd0;

match_1 = 1'd0;

end

else if(key4_1 == hashtable4_1[0][35:4])

begin

addr4_1 = hashtable4_1[0][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[1][35:4])

begin

addr4_1 = hashtable4_1[1][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[2][35:4])

begin

addr4_1 = hashtable4_1[2][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[3][35:4])

begin

addr4_1 = hashtable4_1[3][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[4][35:4])

begin

addr4_1 = hashtable4_1[4][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[5][35:4])

begin

addr4_1 = hashtable4_1[5][3:0];

Page 63: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

55

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[6][35:4])

begin

addr4_1 = hashtable4_1[6][3:0];

match_1 = 1'd1;

end

else if(key4_1 == hashtable4_1[7][35:4])

begin

addr4_1 = hashtable4_1[7][3:0];

match_1 = 1'd1;

end

else

begin

addr4_1 = addr4_1;

match_1 = 1'd0;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable4_2[0] <= 36'd0;

hashtable4_2[1] <= 36'd0;

hashtable4_2[2] <= 36'd0;

hashtable4_2[3] <= 36'd0;

hashtable4_2[4] <= 36'd0;

hashtable4_2[5] <= 36'd0;

hashtable4_2[6] <= 36'd0;

hashtable4_2[7] <= 36'd0;

count_2 <= 4'd0;

i_2 <= 3'd0;

end

else if(match_2 == 1'b0)

begin

hashtable4_2[i_2] <= {key4_2,count_2};

count_2 <= count_2 + 1;

i_2 <= i_2 + 1;

end

end

Page 64: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

56

always@( reset or key4_2 )

begin

if(reset == 1'b1)

begin

addr4_2 = 4'd0;

match_2 = 1'd0;

end

else if(key4_2 == hashtable4_2[0][35:4])

begin

addr4_2 = hashtable4_2[0][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[1][35:4])

begin

addr4_2 = hashtable4_2[1][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[2][35:4])

begin

addr4_2 = hashtable4_2[2][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[3][35:4])

begin

addr4_2 = hashtable4_2[3][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[4][35:4])

begin

addr4_2 = hashtable4_2[4][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[5][35:4])

begin

addr4_2 = hashtable4_2[5][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[6][35:4])

begin

Page 65: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

57

addr4_2 = hashtable4_2[6][3:0];

match_2 = 1'd1;

end

else if(key4_2 == hashtable4_2[7][35:4])

begin

addr4_2 = hashtable4_2[7][3:0];

match_2 = 1'd1;

end

else

begin

addr4_2 = addr4_2;

match_2 = 1'd0;

end

end

endmodule

8 byte hash:

module Hash8(key8,clk,reset,addr8);

input [63:0] key8;

input clk;

input reset;

output [3:0] addr8;

wire [63:0] key8;

wire clk;

wire reset;

reg [3:0] addr8;

reg [67:0] hashtable8 [7:0];

reg match;

reg [3:0] count;

reg [2:0] i;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

hashtable8[0] <= 68'd0;

hashtable8[1] <= 68'd0;

hashtable8[2] <= 68'd0;

hashtable8[3] <= 68'd0;

hashtable8[4] <= 68'd0;

Page 66: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

58

hashtable8[5] <= 68'd0;

hashtable8[6] <= 68'd0;

hashtable8[7] <= 68'd0;

count <= 4'd0;

i <= 3'd0;

end

else if(match == 1'b0)

begin

hashtable8[i] <= {key8,count};

count <= count + 1;

i <= i + 1;

end

end

always@( reset or key8 )

begin

if(reset == 1'b1)

begin

addr8 = 4'd0;

match = 1'd0;

end

else if(key8 == hashtable8[0][67:4])

begin

addr8 = hashtable8[0][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[1][67:4])

begin

addr8 = hashtable8[1][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[2][67:4])

begin

addr8 = hashtable8[2][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[3][67:4])

begin

addr8 = hashtable8[3][3:0];

match = 1'd1;

end

Page 67: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

59

else if(key8 == hashtable8[4][67:4])

begin

addr8 = hashtable8[4][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[5][67:4])

begin

addr8 = hashtable8[5][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[6][67:4])

begin

addr8 = hashtable8[6][3:0];

match = 1'd1;

end

else if(key8 == hashtable8[7][67:4])

begin

addr8 = hashtable8[7][3:0];

match = 1'd1;

end

else

begin

addr8 = addr8;

match = 1'd0;

end

end

endmodule

2 byte dictionary:

module

dict2(addr2_1,addr2_2,addr2_3,addr2_4,key2_1,key2_2,key2_3,key2_4,mis2_1,mis2_2,mis2_3,

mis2_4,clk,reset,data2_1,data2_2,data2_3,data2_4);

input [3:0] addr2_1;

input [3:0] addr2_2;

input [3:0] addr2_3;

input [3:0] addr2_4;

input clk;

input reset;

input mis2_1;

Page 68: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

60

input mis2_2;

input mis2_3;

input mis2_4;

input [15:0] key2_1;

input [15:0] key2_2;

input [15:0] key2_3;

input [15:0] key2_4;

output [15:0] data2_1;

output [15:0] data2_2;

output [15:0] data2_3;

output [15:0] data2_4;

wire [3:0] addr2_1;

wire [3:0] addr2_2;

wire [3:0] addr2_3;

wire [3:0] addr2_4;

wire [15:0] key2_1;

wire [15:0] key2_2;

wire [15:0] key2_3;

wire [15:0] key2_4;

wire clk;

wire reset;

wire mis2_1;

wire mis2_2;

wire mis2_3;

wire mis2_4;

wire [15:0] data2_1;

wire [15:0] data2_2;

wire [15:0] data2_3;

wire [15:0] data2_4;

reg [15:0] dictionary2_1 [15:0];

reg [15:0] dictionary2_2 [15:0];

reg [15:0] dictionary2_3 [15:0];

reg [15:0] dictionary2_4 [15:0];

reg [3:0] count2_1;

reg [3:0] count2_2;

reg [3:0] count2_3;

Page 69: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

61

reg [3:0] count2_4;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_1[0] <= 16'd0;

dictionary2_1[1] <= 16'd0;

dictionary2_1[2] <= 16'd0;

dictionary2_1[3] <= 16'd0;

dictionary2_1[4] <= 16'd0;

dictionary2_1[5] <= 16'd0;

dictionary2_1[6] <= 16'd0;

dictionary2_1[7] <= 16'd0;

dictionary2_1[8] <= 16'd0;

dictionary2_1[9] <= 16'd0;

dictionary2_1[10] <= 16'd0;

dictionary2_1[11] <= 16'd0;

dictionary2_1[12] <= 16'd0;

dictionary2_1[13] <= 16'd0;

dictionary2_1[14] <= 16'd0;

dictionary2_1[15] <= 16'd0;

count2_1 <= 4'd0;

end

else if(mis2_1 == 1'b1)

begin

dictionary2_1[count2_1] <= key2_1;

count2_1 <= count2_1 + 1;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_2[0] <= 16'd0;

dictionary2_2[1] <= 16'd0;

dictionary2_2[2] <= 16'd0;

dictionary2_2[3] <= 16'd0;

dictionary2_2[4] <= 16'd0;

dictionary2_2[5] <= 16'd0;

dictionary2_2[6] <= 16'd0;

dictionary2_2[7] <= 16'd0;

dictionary2_2[8] <= 16'd0;

dictionary2_2[9] <= 16'd0;

Page 70: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

62

dictionary2_2[10] <= 16'd0;

dictionary2_2[11] <= 16'd0;

dictionary2_2[12] <= 16'd0;

dictionary2_2[13] <= 16'd0;

dictionary2_2[14] <= 16'd0;

dictionary2_2[15] <= 16'd0;

count2_2 <= 4'd0;

end

else if(mis2_2 == 1'b1)

begin

dictionary2_2[count2_2] <= key2_2;

count2_2 <= count2_2 + 1;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_3[0] <= 16'd0;

dictionary2_3[1] <= 16'd0;

dictionary2_3[2] <= 16'd0;

dictionary2_3[3] <= 16'd0;

dictionary2_3[4] <= 16'd0;

dictionary2_3[5] <= 16'd0;

dictionary2_3[6] <= 16'd0;

dictionary2_3[7] <= 16'd0;

dictionary2_3[8] <= 16'd0;

dictionary2_3[9] <= 16'd0;

dictionary2_3[10] <= 16'd0;

dictionary2_3[11] <= 16'd0;

dictionary2_3[12] <= 16'd0;

dictionary2_3[13] <= 16'd0;

dictionary2_3[14] <= 16'd0;

dictionary2_3[15] <= 16'd0;

count2_3 <= 4'd0;

end

else if(mis2_3 == 1'b1)

begin

dictionary2_3[count2_3] <= key2_3;

count2_3 <= count2_3 + 1;

end

end

Page 71: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

63

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_4[0] <= 16'd0;

dictionary2_4[1] <= 16'd0;

dictionary2_4[2] <= 16'd0;

dictionary2_4[3] <= 16'd0;

dictionary2_4[4] <= 16'd0;

dictionary2_4[5] <= 16'd0;

dictionary2_4[6] <= 16'd0;

dictionary2_4[7] <= 16'd0;

dictionary2_4[8] <= 16'd0;

dictionary2_4[9] <= 16'd0;

dictionary2_4[10] <= 16'd0;

dictionary2_4[11] <= 16'd0;

dictionary2_4[12] <= 16'd0;

dictionary2_4[13] <= 16'd0;

dictionary2_4[14] <= 16'd0;

dictionary2_4[15] <= 16'd0;

count2_4 <= 4'd0;

end

else if(mis2_4 == 1'b1)

begin

dictionary2_4[count2_1] <= key2_4;

count2_4 <= count2_4 + 1;

end

end

assign data2_1 = dictionary2_1[addr2_1];

assign data2_2 = dictionary2_2[addr2_2];

assign data2_3 = dictionary2_3[addr2_3];

assign data2_4 = dictionary2_4[addr2_4];

endmodule

4 byte dictionary:

module dict4(addr4_1,addr4_2,key4_1,key4_2,mis4_1,mis4_2,clk,reset,data4_1,data4_2);

input [3:0] addr4_1;

input [3:0] addr4_2;

input [31:0] key4_1;

Page 72: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

64

input [31:0] key4_2;

input clk;

input reset;

input mis4_1;

input mis4_2;

output [31:0] data4_1;

output [31:0] data4_2;

wire [3:0] addr4_1;

wire [3:0] addr4_2;

wire [31:0] key4_1;

wire [31:0] key4_2;

wire clk;

wire reset;

wire mis4_1;

wire mis4_2;

wire [31:0] data4_1;

wire [31:0] data4_2;

reg [31:0] dictionary4_1 [15:0];

reg [31:0] dictionary4_2 [15:0];

reg [3:0] count1;

reg [3:0] count2;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary4_1[0] <= 32'd0;

dictionary4_1[1] <= 32'd0;

dictionary4_1[2] <= 32'd0;

dictionary4_1[3] <= 32'd0;

dictionary4_1[4] <= 32'd0;

dictionary4_1[5] <= 32'd0;

dictionary4_1[6] <= 32'd0;

dictionary4_1[7] <= 32'd0;

dictionary4_1[8] <= 32'd0;

dictionary4_1[9] <= 32'd0;

dictionary4_1[10] <= 32'd0;

dictionary4_1[11] <= 32'd0;

dictionary4_1[12] <= 32'd0;

dictionary4_1[13] <= 32'd0;

dictionary4_1[14] <= 32'd0;

dictionary4_1[15] <= 32'd0;

count1 <= 4'd0;

end

Page 73: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

65

else if(mis4_1 == 1'b1)

begin

dictionary4_1[count1] <= key4_1;

count1 <= count1 + 1;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary4_2[0] <= 32'd0;

dictionary4_2[1] <= 32'd0;

dictionary4_2[2] <= 32'd0;

dictionary4_2[3] <= 32'd0;

dictionary4_2[4] <= 32'd0;

dictionary4_2[5] <= 32'd0;

dictionary4_2[6] <= 32'd0;

dictionary4_2[7] <= 32'd0;

dictionary4_2[8] <= 32'd0;

dictionary4_2[9] <= 32'd0;

dictionary4_2[10] <= 32'd0;

dictionary4_2[11] <= 32'd0;

dictionary4_2[12] <= 32'd0;

dictionary4_2[13] <= 32'd0;

dictionary4_2[14] <= 32'd0;

dictionary4_2[15] <= 32'd0;

count2 <= 4'd0;

end

else if(mis4_2 == 1'b1)

begin

dictionary4_2[count2] <= key4_2;

count2 <= count2 + 1;

end

end

assign data4_1 = dictionary4_1[addr4_1];

assign data4_2 = dictionary4_2[addr4_2];

endmodule

8 byte dictionary:

module dict8(addr8,key8,mis8,clk,reset,data8);

input [3:0] addr8;

Page 74: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

66

input [63:0] key8;

input clk;

input reset;

input mis8;

output [63:0] data8;

wire [3:0] addr8;

wire [63:0] key8;

wire clk;

wire reset;

wire mis8;

wire [63:0] data8;

reg [63:0] dictionary8 [15:0];

reg [3:0] count;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary8[0] <= 64'd0;

dictionary8[1] <= 64'd0;

dictionary8[2] <= 64'd0;

dictionary8[3] <= 64'd0;

dictionary8[4] <= 64'd0;

dictionary8[5] <= 64'd0;

dictionary8[6] <= 64'd0;

dictionary8[7] <= 64'd0;

dictionary8[8] <= 64'd0;

dictionary8[9] <= 64'd0;

dictionary8[10] <= 64'd0;

dictionary8[11] <= 64'd0;

dictionary8[12] <= 64'd0;

dictionary8[13] <= 64'd0;

dictionary8[14] <= 64'd0;

dictionary8[15] <= 64'd0;

count <= 4'd0;

end

else if(mis8 == 1'b1)

begin

dictionary8[count] <= key8;

count <= count + 1;

end

end

Page 75: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

67

assign data8 = dictionary8[addr8];

endmodule

Phase comparator:

module phase_comp(data8,

d8,

d4_1,

d4_2,

d2_1,

d2_2,

d2_3,

d2_4,

clk,

reset,

mis8,

mis4_1,

mis4_2,

mis2_1,

mis2_2,

mis2_3,

mis2_4,

pout8,

pout4_1,

pout4_2,

pout2_1,

pout2_2,

pout2_3,

pout2_4);

input [63:0] data8;

input [63:0] d8;

input [31:0] d4_1;

input [31:0] d4_2;

input [15:0] d2_1;

input [15:0] d2_2;

input [15:0] d2_3;

input [15:0] d2_4;

input clk;

input reset;

output mis8;

output mis4_1;

output mis4_2;

Page 76: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

68

output mis2_1;

output mis2_2;

output mis2_3;

output mis2_4;

output pout8;

output pout4_1;

output pout4_2;

output pout2_1;

output pout2_2;

output pout2_3;

output pout2_4;

wire [63:0] data8;

wire [63:0] d8;

wire [31:0] d4_1;

wire [31:0] d4_2;

wire [15:0] d2_1;

wire [15:0] d2_2;

wire [15:0] d2_3;

wire [15:0] d2_4;

wire clk;

wire reset;

wire mis8;

wire mis4_1;

wire mis4_2;

wire mis2_1;

wire mis2_2;

wire mis2_3;

wire mis2_4;

wire pout8;

wire pout4_1;

wire pout4_2;

wire pout2_1;

wire pout2_2;

wire pout2_3;

wire pout2_4;

wire [63:0] data8_temp;

assign data8_temp = data8;

Page 77: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

69

assign mis8 = (reset == 1'd1)?1'd0:

(data8_temp == d8)?1'd0:1'd1;

assign pout8 = (reset == 1'd1)?1'd0:

(data8_temp == d8)?1'd1:1'd0;

assign mis4_1 = (reset == 1'd1)?1'd0:

(data8_temp[63:32] == d4_1)?1'd0:1'd1;

assign pout4_1 = (reset == 1'd1)?1'd0:

(data8_temp[63:32] == d4_1)?1'd1:1'd0;

assign mis4_2 = (reset == 1'd1)?1'd0:

(data8_temp[31:0] == d4_2)?1'd0:1'd1;

assign pout4_2 = (reset == 1'd1)?1'd0:

(data8_temp[31:0] == d4_2)?1'd1:1'd0;

assign mis2_1 = (reset == 1'd1)?1'd0:

(data8_temp[63:48] == d2_1)?1'd0:1'd1;

assign pout2_1 = (reset == 1'd1)?1'd0:

(data8_temp[63:48] == d2_1)?1'd1:1'd0;

assign mis2_2 = (reset == 1'd1)?1'd0:

(data8_temp[47:32] == d2_2)?1'd0:1'd1;

assign pout2_2 = (reset == 1'd1)?1'd0:

(data8_temp[47:32] == d2_2)?1'd1:1'd0;

assign mis2_3 = (reset == 1'd1)?1'd0:

(data8_temp[31:16] == d2_3)?1'd0:1'd1;

assign pout2_3 = (reset == 1'd1)?1'd0:

(data8_temp[31:16] == d2_3)?1'd1:1'd0;

assign mis2_4 = (reset == 1'd1)?1'd0:

(data8_temp[15:0] == d2_4)?1'd0:1'd1;

assign pout2_4 = (reset == 1'd1)?1'd0:

(data8_temp[15:0] == d2_4)?1'd1:1'd0;

endmodule

Encoder:

Page 78: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

70

module encoder(clk,

reset,

data8,

pout8,

pout4_1,

pout4_2,

pout2_1,

pout2_2,

pout2_3,

pout2_4,

addr8,

addr4_1,

addr4_2,

addr2_1,

addr2_2,

addr2_3,

addr2_4,

dataout);

input clk;

input reset;

input [63:0] data8;

input pout8;

input pout4_1;

input pout4_2;

input pout2_1;

input pout2_2;

input pout2_3;

input pout2_4;

input [3:0] addr8;

input [3:0] addr4_1;

input [3:0] addr4_2;

input [3:0] addr2_1;

input [3:0] addr2_2;

input [3:0] addr2_3;

input [3:0] addr2_4;

output [67:0] dataout;

wire [63:0] data8;

wire clk;

wire reset;

Page 79: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

71

wire pout8;

wire pout4_1;

wire pout4_2;

wire pout2_1;

wire pout2_2;

wire pout2_3;

wire pout2_4;

wire [3:0] addr8;

wire [3:0] addr4_1;

wire [3:0] addr4_2;

wire [3:0] addr2_1;

wire [3:0] addr2_2;

wire [3:0] addr2_3;

wire [3:0] addr2_4;

reg [67:0] dataout;

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dataout <= 67'd0;

end

else if(data8 == 64'd0)

begin

dataout <= 67'd0;

end

else if(pout8 == 1'd1)

begin

dataout <= {4'b0001,addr8};

end

else if(pout4_1 == 1'd1)

begin

dataout <= {data8[31:0],4'b0010,addr4_1};

end

else if(pout4_2 == 1'd1)

begin

dataout <= {data8[63:32],4'b0011,addr4_2};

end

else if(pout2_1 == 1'd1)

begin

dataout <= {data8[47:0],4'b0111,addr2_1};

end

Page 80: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

72

else if(pout2_2 == 1'd1)

begin

dataout <= {data8[63:48],data8[31:0],4'b0110,addr2_2};

end

else if(pout2_3 == 1'd1)

begin

dataout <= {data8[63:32],data8[15:0],4'b0101,addr2_3};

end

else if(pout2_4 == 1'd1)

begin

dataout <= {data8[63:16],4'b0100,addr2_4};

end

else

begin

dataout <= {data8,4'b1000};

end

end

endmodule

Decompression

Top module:

module top_dcomp(datain,clk,reset,dataout);

input clk;

input reset;

input [67:0] datain;

output [63:0] dataout;

wire clk;

wire reset;

wire [67:0] datain;

wire [63:0] dataout8;

wire [63:0] Dataout8;

wire [63:0] dataOut8;

wire [67:0] data_out;

wire [63:0] dataout;

Page 81: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

73

wire [3:0] addr8;

wire [3:0] addr4;

wire [3:0] addr2;

decoder d1(clk,reset,datain,data_out,addr8,addr4,addr2);

dict8 d2(addr8,

dataout8,

clk,

reset,

dataout);

dict4 d3(addr4,

Dataout8,

clk,

reset,

dataout);

dict2 d4(addr2,

dataOut8,

clk,

reset,

dataout);

data_gen d5(clk,reset,data_out,dataout8,dataout);

endmodule

Decoder:

module decoder( clk,

reset,

datain,

dataout,

addr8,

addr4,

addr2);

input clk;

input reset;

input [67:0] datain;

output [67:0] dataout;

output [3:0] addr8;

Page 82: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

74

output [3:0] addr4;

output [3:0] addr2;

wire clk;

wire reset;

wire [67:0] datain;

wire [67:0] dataout;

wire [3:0] addr8;

wire [3:0] addr4;

wire [3:0] addr2;

assign dataout = datain;

assign addr8 = (datain[3:0] == 4'd0)?4'd0:datain[7:3];

assign addr4 = (datain[3:0] == 4'd0)?4'd0:datain[7:3];

assign addr2 = (datain[3:0] == 4'd0)?4'd0:datain[7:3];

endmodule

2 byte dictionary:

module dict2(addr8,dataout8,clk,reset,datain8);

input [3:0] addr8;

input clk;

input reset;

input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk;

wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

wire [15:0] datain2_1;

wire [15:0] datain2_2;

wire [15:0] datain2_3;

wire [15:0] datain2_4;

reg [15:0] dictionary2_1 [15:0];

reg [15:0] dictionary2_2 [15:0];

Page 83: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

75

reg [15:0] dictionary2_3 [15:0];

reg [15:0] dictionary2_4 [15:0];

reg [3:0] count2_1;

reg [3:0] count2_2;

reg [3:0] count2_3;

reg [3:0] count2_4;

assign data2_1 = datain8[15:0];

assign data2_2 = datain8[31:16];

assign data2_3 = datain8[47:32];

assign data2_4 = datain8[63:48];

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_1[0] <= 16'd0;

dictionary2_1[1] <= 16'd0;

dictionary2_1[2] <= 16'd0;

dictionary2_1[3] <= 16'd0;

dictionary2_1[4] <= 16'd0;

dictionary2_1[5] <= 16'd0;

dictionary2_1[6] <= 16'd0;

dictionary2_1[7] <= 16'd0;

dictionary2_1[8] <= 16'd0;

dictionary2_1[9] <= 16'd0;

dictionary2_1[10] <= 16'd0;

dictionary2_1[11] <= 16'd0;

dictionary2_1[12] <= 16'd0;

dictionary2_1[13] <= 16'd0;

dictionary2_1[14] <= 16'd0;

dictionary2_1[15] <= 16'd0;

count2_1 <= 4'd0;

end

else if(datain2_1 != dictionary2_1[0] && datain2_1 != dictionary2_1[1] &&

datain2_1 != dictionary2_1[2] &&

datain2_1 != dictionary2_1[3] && datain2_1 != dictionary2_1[4] && datain2_1 !=

dictionary2_1[5] &&

datain2_1 != dictionary2_1[6] && datain2_1 != dictionary2_1[7] && datain2_1 !=

dictionary2_1[8] &&

datain2_1 != dictionary2_1[9] && datain2_1 != dictionary2_1[10] && datain2_1

!= dictionary2_1[11] &&

datain2_1 != dictionary2_1[12] && datain2_1 != dictionary2_1[13] && datain2_1

!= dictionary2_1[14] &&

Page 84: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

76

datain2_1 != dictionary2_1[15] )

begin

dictionary2_1[count2_1] <= datain2_1;

count2_1 <= count2_1 + 1;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_2[0] <= 16'd0;

dictionary2_2[1] <= 16'd0;

dictionary2_2[2] <= 16'd0;

dictionary2_2[3] <= 16'd0;

dictionary2_2[4] <= 16'd0;

dictionary2_2[5] <= 16'd0;

dictionary2_2[6] <= 16'd0;

dictionary2_2[7] <= 16'd0;

dictionary2_2[8] <= 16'd0;

dictionary2_2[9] <= 16'd0;

dictionary2_2[10] <= 16'd0;

dictionary2_2[11] <= 16'd0;

dictionary2_2[12] <= 16'd0;

dictionary2_2[13] <= 16'd0;

dictionary2_2[14] <= 16'd0;

dictionary2_2[15] <= 16'd0;

count2_2 <= 4'd0;

end

else if(datain2_2 != dictionary2_2[0] && datain2_2 != dictionary2_2[1] &&

datain2_2 != dictionary2_2[2] &&

datain2_2 != dictionary2_2[3] && datain2_2 != dictionary2_2[4] && datain2_2 !=

dictionary2_2[5] &&

datain2_2 != dictionary2_2[6] && datain2_2 != dictionary2_2[7] && datain2_2 !=

dictionary2_2[8] &&

datain2_2 != dictionary2_2[9] && datain2_2 != dictionary2_2[10] && datain2_2

!= dictionary2_2[11] &&

datain2_2 != dictionary2_2[12] && datain2_2 != dictionary2_2[13] && datain2_2

!= dictionary2_2[14] &&

datain2_2 != dictionary2_2[15] )

begin

dictionary2_2[count2_2] <= datain2_2;

count2_2 <= count2_2 + 1;

Page 85: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

77

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary2_3[0] <= 16'd0;

dictionary2_3[1] <= 16'd0;

dictionary2_3[2] <= 16'd0;

dictionary2_3[3] <= 16'd0;

dictionary2_3[4] <= 16'd0;

dictionary2_3[5] <= 16'd0;

dictionary2_3[6] <= 16'd0;

dictionary2_3[7] <= 16'd0;

dictionary2_3[8] <= 16'd0;

dictionary2_3[9] <= 16'd0;

dictionary2_3[10] <= 16'd0;

dictionary2_3[11] <= 16'd0;

dictionary2_3[12] <= 16'd0;

dictionary2_3[13] <= 16'd0;

dictionary2_3[14] <= 16'd0;

dictionary2_3[15] <= 16'd0;

count2_3 <= 4'd0;

end

else if(datain2_3 != dictionary2_3[0] && datain2_3 != dictionary2_3[1] &&

datain2_3 != dictionary2_3[2] &&

datain2_3 != dictionary2_3[3] && datain2_3 != dictionary2_3[4] && datain2_3 !=

dictionary2_3[5] &&

datain2_3 != dictionary2_3[6] && datain2_3 != dictionary2_3[7] && datain2_3 !=

dictionary2_3[8] &&

datain2_3 != dictionary2_3[9] && datain2_3 != dictionary2_3[10] && datain2_3

!= dictionary2_3[11] &&

datain2_3 != dictionary2_3[12] && datain2_3 != dictionary2_3[13] && datain2_3

!= dictionary2_3[14] &&

datain2_3 != dictionary2_3[15] )

begin

dictionary2_3[count2_3] <= datain2_3;

count2_3 <= count2_3 + 1;

end

end

always@(posedge clk)

begin

Page 86: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

78

if(reset == 1'b1)

begin

dictionary2_4[0] <= 16'd0;

dictionary2_4[1] <= 16'd0;

dictionary2_4[2] <= 16'd0;

dictionary2_4[3] <= 16'd0;

dictionary2_4[4] <= 16'd0;

dictionary2_4[5] <= 16'd0;

dictionary2_4[6] <= 16'd0;

dictionary2_4[7] <= 16'd0;

dictionary2_4[8] <= 16'd0;

dictionary2_4[9] <= 16'd0;

dictionary2_4[10] <= 16'd0;

dictionary2_4[11] <= 16'd0;

dictionary2_4[12] <= 16'd0;

dictionary2_4[13] <= 16'd0;

dictionary2_4[14] <= 16'd0;

dictionary2_4[15] <= 16'd0;

count2_4 <= 4'd0;

end

else if(datain2_4 != dictionary2_4[0] && datain2_4 != dictionary2_4[1] &&

datain2_4 != dictionary2_4[2] &&

datain2_4 != dictionary2_4[3] && datain2_4 != dictionary2_4[4] && datain2_4 !=

dictionary2_4[5] &&

datain2_4 != dictionary2_4[6] && datain2_4 != dictionary2_4[7] && datain2_4 !=

dictionary2_4[8] &&

datain2_4 != dictionary2_4[9] && datain2_4 != dictionary2_4[10] && datain2_4

!= dictionary2_4[11] &&

datain2_4 != dictionary2_4[12] && datain2_4 != dictionary2_4[13] && datain2_4

!= dictionary2_4[14] &&

datain2_4 != dictionary2_4[15] )

begin

dictionary2_4[count2_4] <= datain2_4;

count2_4 <= count2_4 + 1;

end

end

assign dataout8 =

{dictionary2_4[addr8],dictionary2_3[addr8],dictionary2_2[addr8],dictionary2_1[addr8]};

endmodule

4 byte dictionary:

Page 87: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

79

module dict4(addr8,dataout8,clk,reset,datain8);

input [3:0] addr8;

input clk;

input reset;

input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk;

wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

wire [31:0] datain4_1;

wire [31:0] datain4_2;

reg [31:0] dictionary4_1 [15:0];

reg [31:0] dictionary4_2 [15:0];

reg [3:0] count4_1;

reg [3:0] count4_2;

assign data4_1 = datain8[31:0];

assign data4_2 = datain8[63:32];

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary4_1[0] <= 32'd0;

dictionary4_1[1] <= 32'd0;

dictionary4_1[2] <= 32'd0;

dictionary4_1[3] <= 32'd0;

dictionary4_1[4] <= 32'd0;

dictionary4_1[5] <= 32'd0;

dictionary4_1[6] <= 32'd0;

dictionary4_1[7] <= 32'd0;

dictionary4_1[8] <= 32'd0;

dictionary4_1[9] <= 32'd0;

dictionary4_1[10] <= 32'd0;

dictionary4_1[11] <= 32'd0;

dictionary4_1[12] <= 32'd0;

Page 88: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

80

dictionary4_1[13] <= 32'd0;

dictionary4_1[14] <= 32'd0;

dictionary4_1[15] <= 32'd0;

count4_1 <= 4'd0;

end

else if(datain4_1 != dictionary4_1[0] && datain4_1 != dictionary4_1[1] &&

datain4_1 != dictionary4_1[2] &&

datain4_1 != dictionary4_1[3] && datain4_1 != dictionary4_1[4] && datain4_1 !=

dictionary4_1[5] &&

datain4_1 != dictionary4_1[6] && datain4_1 != dictionary4_1[7] && datain4_1 !=

dictionary4_1[8] &&

datain4_1 != dictionary4_1[9] && datain4_1 != dictionary4_1[10] && datain4_1

!= dictionary4_1[11] &&

datain4_1 != dictionary4_1[12] && datain4_1 != dictionary4_1[13] && datain4_1

!= dictionary4_1[14] &&

datain4_1 != dictionary4_1[15] )

begin

dictionary4_1[count4_1] <= datain4_1;

count4_1 <= count4_1 + 1;

end

end

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary4_2[0] <= 32'd0;

dictionary4_2[1] <= 32'd0;

dictionary4_2[2] <= 32'd0;

dictionary4_2[3] <= 32'd0;

dictionary4_2[4] <= 32'd0;

dictionary4_2[5] <= 32'd0;

dictionary4_2[6] <= 32'd0;

dictionary4_2[7] <= 32'd0;

dictionary4_2[8] <= 32'd0;

dictionary4_2[9] <= 32'd0;

dictionary4_2[10] <= 32'd0;

dictionary4_2[11] <= 32'd0;

dictionary4_2[12] <= 32'd0;

dictionary4_2[13] <= 32'd0;

dictionary4_2[14] <= 32'd0;

dictionary4_2[15] <= 32'd0;

count4_2 <= 4'd0;

end

Page 89: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

81

else if(datain4_2 != dictionary4_2[0] && datain4_2 != dictionary4_2[1] &&

datain4_2 != dictionary4_2[2] &&

datain4_2 != dictionary4_2[3] && datain4_2 != dictionary4_2[4] && datain4_2 !=

dictionary4_2[5] &&

datain4_2 != dictionary4_2[6] && datain4_2 != dictionary4_2[7] && datain4_2 !=

dictionary4_2[8] &&

datain4_2 != dictionary4_2[9] && datain4_2 != dictionary4_2[10] && datain4_2

!= dictionary4_2[11] &&

datain4_2 != dictionary4_2[12] && datain4_2 != dictionary4_2[13] && datain4_2

!= dictionary4_2[14] &&

datain4_2 != dictionary4_2[15] )

begin

dictionary4_2[count4_2] <= datain4_2;

count4_2 <= count4_2 + 1;

end

end

assign dataout8 = {dictionary4_1[addr8],dictionary4_2[addr8]};

endmodule

8 byte dictionary:

module dict8(addr8,dataout8,clk,reset,datain8);

input [3:0] addr8;

input clk;

input reset;

input [63:0] datain8;

output [63:0] dataout8;

wire [3:0] addr8;

wire clk;

wire reset;

wire [63:0] datain8;

wire [63:0] dataout8;

reg [63:0] dictionary8 [15:0];

reg [3:0] count;

Page 90: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

82

always@(posedge clk)

begin

if(reset == 1'b1)

begin

dictionary8[0] <= 64'd0;

dictionary8[1] <= 64'd0;

dictionary8[2] <= 64'd0;

dictionary8[3] <= 64'd0;

dictionary8[4] <= 64'd0;

dictionary8[5] <= 64'd0;

dictionary8[6] <= 64'd0;

dictionary8[7] <= 64'd0;

dictionary8[8] <= 64'd0;

dictionary8[9] <= 64'd0;

dictionary8[10] <= 64'd0;

dictionary8[11] <= 64'd0;

dictionary8[12] <= 64'd0;

dictionary8[13] <= 64'd0;

dictionary8[14] <= 64'd0;

dictionary8[15] <= 64'd0;

count <= 4'd0;

end

else if(datain8 != dictionary8[0] && datain8 != dictionary8[1] && datain8 !=

dictionary8[2] &&

datain8 != dictionary8[3] && datain8 != dictionary8[4] && datain8 !=

dictionary8[5] &&

datain8 != dictionary8[6] && datain8 != dictionary8[7] && datain8 !=

dictionary8[8] &&

datain8 != dictionary8[9] && datain8 != dictionary8[10] && datain8 !=

dictionary8[11] &&

datain8 != dictionary8[12] && datain8 != dictionary8[13] && datain8 !=

dictionary8[14] &&

datain8 != dictionary8[15] )

begin

dictionary8[count] <= datain8;

count <= count + 1;

end

end

assign dataout8 = dictionary8[addr8];

endmodule

Data generator:

Page 91: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

83

module data_gen( clk,

reset,

datain,

datain8,

dataout);

input clk;

input reset;

input [67:0] datain;

input [63:0] datain8;

output [63:0] dataout;

wire clk;

wire reset;

wire [67:0] datain;

wire [63:0] datain8;

wire [63:0] dataout;

assign dataout = (datain[7:4] == 4'b0001)? datain8:

(datain[7:4] == 4'b0010)? {datain8[63:32],datain[39:8]}:

(datain[7:4] == 4'b0011)? {datain[39:8],datain8[31:0]}:

(datain[7:4] == 4'b0100)? {datain[55:8],datain8[15:0]}:

(datain[7:4] == 4'b0101)? {datain[55:24],datain8[31:16],datain[23:8]}:

(datain[7:4] == 4'b0110)? {datain[55:40],datain8[47:32],datain[39:8]}:

(datain[7:4] == 4'b0111)? {datain8[63:48],datain[55:8]}:

(datain[7:4] == 4'b1000)? {datain[67:4]}:64'd0;

endmodule

Page 92: CALIFORNIA STATE UNIVERSITY, NORTHRIDGE High-Throughput

84