a high-performance fpga-based implementation …roger/565m.f12/06270677.pdfa high-performance...

5
A High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian Weis, Norbert Wehn Microelectronic systems design research group TU Kaiserslautern Germany {shcherbakov, weis, wehn}@eit.uni-kl.de Abstract—The increasing growth of embedded networking applications has created a demand for high-performance logging systems capable of storing huge amounts of high-bandwidth, typically redundant data. An efficient way of maximizing the logger performance is doing a real-time compression of the logged stream. In this paper we present a flexible high-performance implementation of the LZSS compression algorithm capable of processing up to 50 MB/s on a Virtex-5 FPGA chip. We exploit the independently addressable dual-port block RAMs inside the FPGA chip to achieve an average performance of 2 clock cycles per byte. To make the compressed stream compatible with the ZLib library [1] we encode the LZSS algorithm output using a fixed Huffman table defined by the Deflate specification [2]. We also demonstrate how changing the amount of memory allocated to various internal tables impacts the performance and compression ratio. Finally, we provide a cycle-accurate estimation tool that allows finding a trade-off between FPGA resource utilization, compression ratio and performance for a specific data sample. I. I NTRODUCTION Compression is a way of representing data in a way that requires fewer bits. All compression algorithms can be divided into lossy and lossless ones. Lossy compression algorithms discard certain “less important” parts of the information and are highly tied to the structure of the data. In lots of appli- cations (e.g. video coding) the trade-off between the quality drop and significant size reduction is more than acceptable. Lossless algorithms exploit the redundancy and predictabil- ity of the compressed information: highly redundant data will need less bits to be stored. In the worst case when no redundancies are found, the “compressed” block will actually be bigger than the uncompressed one. However, in any case, the decompression will restore the original data, bit-by-bit. Many lossless compression algorithms are widely used in modern servers and workstations for varieties of tasks. Those algorithms, such as LZMA [3], provide high compression ratios, but require tens to hundreds of megabytes of RAM and a fast CPU. This suits in most of the use cases (e.g. backups), where compression ratio is more important than speed. Another use for lossless compression is the rapidly grow- ing embedded networking system world. Keeping a log of inter-node communications significantly simplifies profiling and debugging tasks. Compressing the logged stream in real time would relax the size and bandwidth requirements for the underlying storage media. Unlike the workstation/server applications, compression throughput becomes one of the most important constraints. Modern FPGAs allow building powerful embedded systems on one chip. A typical high-end FPGA contains tens to hun- dreds of independent dual-port block RAMs (several kilobytes each), one or more built-in CPUs and a lot of reconfigurable logic. The logic operates at lower frequencies than the CPU, however allows exploiting massive algorithmic parallelism. This defines the layout of a typical FPGA-based system-on- chip: a central CPU handling high-level tasks and several accelerators performing highly parallelizable computations. Making a high-performance FPGA-based compressor would require selecting an algorithm capable of exploiting the ad- vantages of the above-mentioned FPGA architecture. Having considered several compression algorithms, we have chosen a subset of the Deflate specification [2] (LZSS [4] + fixed- table Huffman encoding) as it can be efficiently implemented using FPGA logic and block RAMs, keeping the on-chip CPU available for higher-level tasks. The rest of this paper is organized as follows. Section 2 overviews the work related to the LZSS-like algorithms and related hardware. Section 3 brielfy describes the data format. Section 4 describes the presented hardware implementation. Section 5 provides a comparison with a software compressor and shows various trade-offs between speed, compression ratio and memory use. Section 6 summarizes the results. II. RELATED WORK Since the publication the LZSS algorithm [4] based on LZ77 [5], there have been many improvements to both algorithmic and implementation aspects that can be categorized in the following way: Further algorithmic improvements. Improve compression ratio at a cost of more operations and memory. E.g. the LZMA algorithm [3] used in the 7-Zip program. Algorithm variations (e.g. [6]) simplifying random access of the compressed data. Hardware implementations that rely on content- addressable memories [7] and systolic arrays [8], [9]. Applications of fast hardware decompression for dynamic FPGA reconfiguration [10]. FPGA/ASIC-based implementations: [11], [12]. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.58 442 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.58 442 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum 978-0-7695-4676-6/12 $26.00 © 2012 IEEE DOI 10.1109/IPDPSW.2012.58 449

Upload: hoanghuong

Post on 10-Jun-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A High-Performance FPGA-Based Implementation …roger/565M.f12/06270677.pdfA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian

A High-Performance FPGA-Based Implementationof the LZSS Compression Algorithm

Ivan Shcherbakov, Christian Weis, Norbert WehnMicroelectronic systems design research group

TU KaiserslauternGermany

{shcherbakov, weis, wehn}@eit.uni-kl.de

Abstract—The increasing growth of embedded networkingapplications has created a demand for high-performance loggingsystems capable of storing huge amounts of high-bandwidth,typically redundant data. An efficient way of maximizing thelogger performance is doing a real-time compression of the loggedstream. In this paper we present a flexible high-performanceimplementation of the LZSS compression algorithm capable ofprocessing up to 50 MB/s on a Virtex-5 FPGA chip. We exploitthe independently addressable dual-port block RAMs inside theFPGA chip to achieve an average performance of 2 clock cyclesper byte. To make the compressed stream compatible with theZLib library [1] we encode the LZSS algorithm output usinga fixed Huffman table defined by the Deflate specification [2].We also demonstrate how changing the amount of memoryallocated to various internal tables impacts the performance andcompression ratio. Finally, we provide a cycle-accurate estimationtool that allows finding a trade-off between FPGA resourceutilization, compression ratio and performance for a specific datasample.

I. INTRODUCTION

Compression is a way of representing data in a way thatrequires fewer bits. All compression algorithms can be dividedinto lossy and lossless ones. Lossy compression algorithmsdiscard certain “less important” parts of the information andare highly tied to the structure of the data. In lots of appli-cations (e.g. video coding) the trade-off between the qualitydrop and significant size reduction is more than acceptable.

Lossless algorithms exploit the redundancy and predictabil-ity of the compressed information: highly redundant datawill need less bits to be stored. In the worst case when noredundancies are found, the “compressed” block will actuallybe bigger than the uncompressed one. However, in any case,the decompression will restore the original data, bit-by-bit.

Many lossless compression algorithms are widely used inmodern servers and workstations for varieties of tasks. Thosealgorithms, such as LZMA [3], provide high compressionratios, but require tens to hundreds of megabytes of RAM anda fast CPU. This suits in most of the use cases (e.g. backups),where compression ratio is more important than speed.

Another use for lossless compression is the rapidly grow-ing embedded networking system world. Keeping a log ofinter-node communications significantly simplifies profilingand debugging tasks. Compressing the logged stream in realtime would relax the size and bandwidth requirements forthe underlying storage media. Unlike the workstation/server

applications, compression throughput becomes one of the mostimportant constraints.

Modern FPGAs allow building powerful embedded systemson one chip. A typical high-end FPGA contains tens to hun-dreds of independent dual-port block RAMs (several kilobyteseach), one or more built-in CPUs and a lot of reconfigurablelogic. The logic operates at lower frequencies than the CPU,however allows exploiting massive algorithmic parallelism.This defines the layout of a typical FPGA-based system-on-chip: a central CPU handling high-level tasks and severalaccelerators performing highly parallelizable computations.Making a high-performance FPGA-based compressor wouldrequire selecting an algorithm capable of exploiting the ad-vantages of the above-mentioned FPGA architecture. Havingconsidered several compression algorithms, we have chosena subset of the Deflate specification [2] (LZSS [4] + fixed-table Huffman encoding) as it can be efficiently implementedusing FPGA logic and block RAMs, keeping the on-chip CPUavailable for higher-level tasks.

The rest of this paper is organized as follows. Section 2overviews the work related to the LZSS-like algorithms andrelated hardware. Section 3 brielfy describes the data format.Section 4 describes the presented hardware implementation.Section 5 provides a comparison with a software compressorand shows various trade-offs between speed, compression ratioand memory use. Section 6 summarizes the results.

II. RELATED WORK

Since the publication the LZSS algorithm [4] based on LZ77[5], there have been many improvements to both algorithmicand implementation aspects that can be categorized in thefollowing way:

• Further algorithmic improvements. Improve compressionratio at a cost of more operations and memory. E.g. theLZMA algorithm [3] used in the 7-Zip program.

• Algorithm variations (e.g. [6]) simplifying random accessof the compressed data.

• Hardware implementations that rely on content-addressable memories [7] and systolic arrays [8],[9].

• Applications of fast hardware decompression for dynamicFPGA reconfiguration [10].

• FPGA/ASIC-based implementations: [11], [12].

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops

978-0-7695-4676-6/12 $26.00 © 2012 IEEE

DOI 10.1109/IPDPSW.2012.58

442

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

978-0-7695-4676-6/12 $26.00 © 2012 IEEE

DOI 10.1109/IPDPSW.2012.58

442

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

978-0-7695-4676-6/12 $26.00 © 2012 IEEE

DOI 10.1109/IPDPSW.2012.58

449

Page 2: A High-Performance FPGA-Based Implementation …roger/565M.f12/06270677.pdfA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian

The presented implementation falls in the last category. Wehave used the approach presented in [11] (FSM with severalindependent memories) and significantly optimized the designperformance by decomposing and parallelizing several pro-cesses (employing dual-port block RAM architecture), makinguse of wide internal data buses and advanced caching/prefetch-ing techniques.

As we are targeting embedded logging applications, we haveoptimized the compression speed while keeping feasible com-pression ratio, taking the minimum ZLib compression levelas a reference point. Nevertheless, the memory sizes and thealgorithm parameters are generics that can be easily adjustedto increase the compression ratio at a cost of additional clockcycles and/or extra block RAMs.

III. THE DATA FORMAT

Before describing the hardware architecture, we will give anoverview of the LZSS algorithm (ZLib-based implementationthat has minor differences from the original LZSS [4]).

The algorithm consumes a stream of literals (i.e. bytes)and produces a stream of decompressor commands. Thereare 2 command types: “output 1 literal” and “copy-paste Lliterals encountered D literals ago”. E.g. compressing a string“snowy snow” will result in 7 commands: 6 describing eachbyte of “snowy ” and 1 command copying 4 bytes (“snow”)from distance 6.

To detect that a string has been encountered in the past, thecompressor has to store the last N bytes of the input stream.N is referred as dictionary size or sliding window size.

On the bit level, every command has 2 fields: D (log2 Nbits) and L (8 bits). If D is 0, the command is “output byte”and L contains the byte. Otherwise, D contains the copyingdistance and L contains the copying length minus 3. If thelength is less than 3, normal “output byte” commands areemitted instead.

IV. HARDWARE ARCHITECTURE

The LZSS compressor uses handshake interfaces for bothinput and output streams. The compressor consumes 32-bitwords (LSBF/MSBF format can be selected) and produces D/Lpairs (see section III) used by a fixed-table Huffman coder thatproduces a stream of packed 32-bit words.

The use of stream interfaces allows connecting to high-performance interfaces (e.g. LocalLink [13]) and compressingreal-time streaming data on-the-fly without separate bufferingand compressing stages.

The LZSS compressor consists of the main finite statemachine, 5 independently addressable dual-port memoriesand several auxiliary FSMs. Figure 1 illustrates the overallstructure.

The main task of the compressor is finding previous oc-currences of a string (matching). To determine if a string Shas been encountered before, the compressor computes a hashvalue from its first 3 bytes and looks through the list of allstrings in the dictionary with the same hash value.

Fig. 1. Overall structure of the LZSS compressor

Head

Next

Lookahead buffer

Hash cache

Dictionary

Comparer

BaseIter +

FSMFillinglogic

Output

- memories

Matching requires comparing the front of the uncompressedstream with several offsets inside the dictionary to find thelongest match. To speed up the comparison we have placedthem into 2 independently addressable ring buffers:

• Lookahead buffer. Contains the front of the input stream(up to 512 bytes).

• Dictionary (a.k.a. Sliding Window). Contains the last Nbytes of the input stream that have just been processed.

Having the compared memories in independent block RAMsallows performing one comparison iteration every clock cycle.Furthermore, the data bus width for both memories is 32 bits,that allows comparing 1 to 4 bytes during the first clock cycleand exactly 4 bytes during each following one. E.g. comparingtwo 50-byte strings would take not more thand(50− 1) /4e+1 = 14 clock cycles instead of 100 if both buffers resided ina single byte-addressed memory [11].

Moreover, both ring buffers reside in dual-port block RAMsand are filled in the background requiring no extra clock cyclesof the main FSM. If the hash caching was enabled, hash valuesfor every offset of the source stream are computed duringbackground filling and stored in a separate memory.

The hash table structure is similar to the ZLib. The follow-ing independently addressable tables are maintained:

• Head table. For each value of the hash function itcontains the offset in the dictionary buffer of the laststring having it.

• Next table. For each offset of the dictionary it containsthe relative offset of the previous string having the samehash value.

The matching means finding the longest string in the dictionarythat is equivalent to the beginning of the lookahead buffer.E.g. if the dictionary contains “1234 123” and the lookaheadbuffer contains “123456”, the “1234” and “123” will bethe candidates and “1234” will be the longest one. If ahash collision occurs (e.g. hash(“123”) = hash(“34 ”), then“34 123” will also be considered a candidate. The addressesof the candidate strings are obtained by reading head/nexttables.

Once the matching is done, a decompressor command isproduced and the lookahead buffer/dictionary pointers aremoved. An optional hash table updating takes place afterwards.

443443450

Page 3: A High-Performance FPGA-Based Implementation …roger/565M.f12/06270677.pdfA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian

To illustrate how the high compression performance isachieved we will describe a typical state flow of the mainFSM:

• Initially, the compressor waits until the lookahead buffercontains at least 262 bytes and the hash value of its frontis available. As the filling runs in background, this statetypically takes only 1 clock cycle. The hash value fromthe lookahead buffer is routed to the head table address.

• As soon as the data is available, the matching preparationoccurs. The value from the head table is used as the firststring address. It is also is routed to the next table to getthe address of the next string with the same hash value.The head and the next tables are updated in this cycle toallow finding the currently processed string in the future.

• At the next clock cycle the matching begins. The nexttable is read in parallel, so the bottleneck here is the actualcomparison of the strings (accelerated by 32-bit buses).

• When the matching is complete, the output is produced.The values for D and L depending on the matching resultsare output to the compressed stream interface. If the sinkrequests a delay, the main FSM is stalled.

• If a full hash table updating can be performed (decidedbased on match length), the FSM updates the head/nexttables for every byte of the matched string. Every updateiteration takes 1 clock cycle.

• When the hash table updating is done (or was skipped),the FSM enters the initial waiting stage.

Depending on the properties of the input data, 30-85% of thematching operations will be unsuccessful and end up withproducing the “output literal” commands, requiring at least 3clock cycles (+matching). We have implemented a special hashprefetching mechanism accelerating this scenario. A separateFSM is active during the match preparation and matching. Itbuffers the data from the lookahead buffer and the hash cacheand uses the available clock cycles to prefetch (or precompute)the hash value at offset 1 in the lookahead buffer. If no matchwas found (i.e. the lookahead buffer is going to be advancedby 1 byte), the prefetched value is routed to the head tableaddress and the FSM goes directly to match preparation stateskipping the waiting state requiring only 2 “non-matching”cycles instead of 3.

The concept of head/next tables was introduced in ZLib [1]and mentioned in [11]. Originally, both head and next tablescontain absolute string offsets inside the dictionary. Every 32K(dictionary size) bytes, ZLib rotates the dictionary: the last32K bytes are moved up (a total of 64 Kbytes is allocated)and each head/next value is adjusted accordingly (the onespointing outside the buffer are zeroed). The time overhead isnegligible in the slow software, however it would consume25-75% of the clock cycles (depending on hash/dictionarysizes) for the fast hardware implementation. We have done 3improvements that reduce the clock cycle overhead to 1-2%:

• The next table contains relative addresses. This requires1 extra adder, to compute the absolute address, buteliminates the need to rotate the next table.

• Every record inside the head table contains k extra“generation bits” as if the dictionary was 2k times bigger.The real dictionary size is still used to detect whether arecord points outside the dictionary, but the rotating hasto be performed 2k times rarer.

• The head table memory is internally split into M sub-memories, each having the size of a single block RAMinside the FPGA. The rotation happens in parallel andrequires M times less cycles.

The output interface of the LZSS compressor is connectedto a fixed-table pipelined Huffman encoder that produces aZLib-compatible stream. As the table is fixed, no additionalclock cycles or memories are required to build it and theencoder does not introduce any delays to the stream producedby the LZSS compressor. The cost for the high performance isless efficient compression compared to the dynamic huffmancoders, however, it can be also compensated by increasingLZSS compression level.

Our implementation is generic. Various compile-time pa-rameters can be customized to find an optimal trade-offbetween FPGA resource utilization, compression ratio andspeed. Dictionary size, hash bit count, exact hash function,generation bit count, and the head table division factor can becustomized during compile-time. Run-time parameters (e.g.matching iteration limit), can also be changed. We haveprovided an interactive estimation tool that compresses a givenfile using several presets and produces reports regarding theblock RAM amount, compression ratio and clock cycle usage.

To maintain high design modularity and decompose thearchitecture and the low-level details (e.g. hash function, datatypes and bus sizes), we have used the policy class-baseddesign approach and the THDL++ language [14] that extendsVHDL semantics by object-oriented features. THDL++ codecan be compiled to VHDL-93 using the freely availablecompiler and an IDE [15].

V. RESULTS

In this section we evaluate the LZSS compressor designby comparing its performance to a software implementation,provide the FPGA utilization information and show the impactof various design settings on the design size and performance.

Our test system is the ML-507 development board basedon a Virtex-5 FPGA. We have developed a testbench thatreceives a data block from the PC over Ethernet, stores it inthe DDR2 memory, compresses it and sends the result back.The compression time includes the DMA [13] setup times, butexcludes Ethernet transmission time.

We have compared a software implementation (ZLib1.5.2 [1] running on the PowerPC processor inside theXC5VFX70T-FF1136-1 FPGA) and the hardware implemen-tation with parameters optimized for speed (4KB dictionary,15-bit hash). The clock frequency of the PowerPC was 400MHz while the compressor was connected to a 100 MHz clock(post-route analysis reported a maximum clock frequency of100.5 MHz). We have used 2 data sets: a fragment of aWikipedia text snapshot [16] (referred as Wiki) and sample

444444451

Page 4: A High-Performance FPGA-Based Implementation …roger/565M.f12/06270677.pdfA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian

data obtained from an automotive CAN logger (referred asX2E). We have run the test with a 10MB and a 50MBfragments to factor out DMA setup time. Table I shows theperformance comparison (parameters, input and output streamswere equal).

TABLE IPERFORMANCE EVALUATION

Data sample SW speed HW sped Speedup Compression(MB/s) (MB/s) ratio

Wiki 50MB 3.15 47.57 15.06x 1.69Wiki 10MB 3.11 44.05 14.15x 1.68X2E 50MB 2.55 51.86 20.32x 1.7X2E 10MB 2.54 47.84 18.78x 1.7

Additionally to the 15-20x performance increase, the use ofthe DMA engine to transfer the data between the DRAM andthe hardware compressor allows running high-level tasks onthe CPU in parallel with the compression.

Table II shows that FPGA utilization in terms of lookuptables (LZSS + fixed-table Huffman) remains insignificant andalmost the same (5.2+=0.6% of the Virtex5 FPGA) for allreasonable dictionary sizes and hash sizes.

TABLE IIFPGA UTILIZATION

Hash size Dictionary size LUTs Registers15 bits 32KB 2620 86412 bits 8KB 2263 8179 bits 4KB 2105 775Available in XC5VFX70T FPGA 44800 44800

To simplify design space exploration we have developeda software estimator tool[17]. The tool consists of a flexiblecycle-accurate C++ model and a C# front-end. The C++model accepts various design parameters (e.g. window size),compresses reference data blocks and produces various cycle-accurate statistics. The C# front-end allows constructing seriesof parameter sets (e.g. iterating an arbitrary parameter over agiven range), iteratively runs the C++ model and visualizesthe obtained results.

The rest of this section describes several trade-offs exploredby running a 100MB Wikipedia snapshot [16] through thesoftware estimator. First of all, increasing the dictionary sizeimproves the compression ratio (fig. 2). Moreover, the im-provement is more significant for larger hash sizes.

Fig. 2. Compressed size in MB of a 100MB Wiki fragment [16]73.2

52.355.358.361.364.367.370.3

32K1K 2K 4K 8K 16KDictionary:

8

Hashbits:

9

1215

Increasing the dictionary size slightly slows down the com-pression. This can be compensated by increasing the hash size

(fig. 3) and thus, lowering hash collision probability and re-ducing the amount of matching iterations. However, increasinghash size raises the memory requirements exponentially (headtable requires 2H × (dlog2 De+G) bits where H is the hashbit count, D is the dictionary size and G is the amount ofgeneration bits).

Fig. 3. Compression speed (MB/s) for a 100MB Wiki fragment[16]49.3

40.6

45.6

1510 11 12 13 14

35.6

Hash bits:

Dictionary

2K4K

16K8K

32K

Another way of improving compression efficiency is ad-justing the algorithm parameters (e.g. amount of matchingattempts before giving up). This can improve the compressionby 20% at a cost of 82% performance decrease (fig. 4).

Fig. 4. Compressed size and speed for a 100MB Wiki fragment [16] formin/max compression levels and 2 hash size options

73 MB

45 MB50 MB54 MB59 MB63 MB68 MB

32K1K 2K 4K 8K 16K

49 MB/s

8 MB/s18 MB/s

28 MB/s38 MB/s

Dictionary:

Hash bits;compression level:

15 bits;max9 bits;max

15 bits;min

9 bits;min

15 bits;max9 bits;max

15 bits;min9 bits;min

As the other hardware implementations [11], [12] do notprovide exact performance results, we have analyzed theimpact of 3 main optimization techniques compared to thedesign described in [11] by temporarily disabling them andmeasuring the performance impact. Table III summarizes theresults.

TABLE IIICOMPRESSION SPEED FOR A 100 MB WIKI FRAGMENT WITHOUT

OPTIMIZATIONS

Configuration Window size4KB 32KB

A) Original (15-bit hash; 32-bit data) 49.0 MB/s 46.2 MB/sB) 8-bit data bus as in [11] 30.3 MB/s 25.9 MB/sC) Disabled hash prefetching 45.2 MB/s 45.0 MB/sD) Reduced “generation bits” to 1 11.9 MB/s 33.8 MB/sDisabled all 3 optimizations over [11] 10.2 MB/s 21.2 MB/s

This most efficient optimization for small window sizes isthe introduction of “generation bits”, as using k generationbits makes next table rotation occur 2k times rarer (if k is1, rotation happens every D bytes, where D is the dictionarysize). Using wide data buses provides a 63-78% performanceincrease and hash prefetching increases the performance byadditional 8%. The overall performance increase due to the

445445452

Page 5: A High-Performance FPGA-Based Implementation …roger/565M.f12/06270677.pdfA High-Performance FPGA-Based Implementation of the LZSS Compression Algorithm Ivan Shcherbakov, Christian

described optimizations is 2.2x-4.8x depending on the windowsize.

As an indirect metric of an LZSS compressor efficiency wehave measured the amount of clock cycles spent on actuallycomparing the data from the dictionary with the lookaheadbuffer (compared to the clock cycles spent on updating hashtables, computing read addresses, etc.). Figure 5 shows statedistribution for the 100 MB Wiki fragment with a 32KBdictionary and 15-bit hash.

Fig. 5. Time spent on different operations (100MB Wiki fragment)

Rotating hash (0,3%)

Fetching data (0,2%)

Finding match (68,5%)

Waiting for data (8,4%)

Producing output (11,0%)

Updating hash table (11,6%)

Most of the time (68.5%) is spent on reading and comparingthe data (up to 4 bytes per cycle from each dictionary andlookahead buffer). Producing the output and prefetching thenext hash value in parallel takes 11% of the time. Another11.6% of the time is spent on inserting every byte of a shortmatch (up to 4 bytes) in the hash table. Finally, 8.4% of thetime is spent on waiting for the head table to be read when theprefetched hash value is not useful (i.e. when a valid matchis found and several bytes are skipped).

VI. CONCLUSION

In this paper we have presented a high-performance flexibleimplementation of the LZSS algorithm on a Virtex5 FPGA.We have exploited the independently addressable dual-portblock RAMs and performed several specific FSM and datastructure optimization, resulting in a 15-20x performanceincrease compared to the optimized software implementation[1]. The compressor design is flexible and allows tuningvarious parameters to achieve trade-offs between the speed,compression ratio and block RAM utilization. An estimationtool available online [17] allows performing design spaceexploration and finding optimal parameters based on realdata samples. We have verified the quality of our designby compressing more than 1 TB of data on the FPGA andcomparing the results to software reference model.

REFERENCES

[1] (2011, Sep.) Zlib compression library. [Online]. Available: http://zlib.net/[2] N. W. Group. (1996, May) Deflate compressed data format specification.

[Online]. Available: http://www.gzip.org/zlib/rfc1951.pdf[3] (2011, Sep.) Lzma sdk. [Online]. Available: http://www.7-

zip.org/sdk.html[4] (2011, Sep.) Lzss algorithm. [Online]. Available:

http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/lz/lzss.html[5] J. Ziv and A. Lempel, “A universal algorithm for sequential data

compression,” Information Theory, IEEE Transactions on, vol. 23, no. 3,pp. 337 – 343, may 1977.

[6] S. Kreft and G. Navarro, “Lz77-like compression with fast randomaccess,” in Data Compression Conference (DCC), 2010, march 2010,pp. 239 –248.

[7] P. Rauschert, Y. Klimets, J. Velten, and A. Kummert, “Very fast gzipcompression by means of content addressable memories,” in TENCON2004. 2004 IEEE Region 10 Conference, vol. D, nov. 2004, pp. 391 –394 Vol. 4.

[8] J.-M. Chen and C.-H. Wei, “Vlsi design for high-speed lz-based datacompression,” Circuits, Devices and Systems, IEE Proceedings -, vol.146, no. 5, pp. 268 –278, oct 1999.

[9] B. Jung and W. Burleson, “A vlsi systolic array architecture for lempel-ziv-based data compression,” in Circuits and Systems, 1994. ISCAS ’94.,1994 IEEE International Symposium on, vol. 3, may-2 jun 1994, pp. 65–68 vol.3.

[10] M. Huebner, M. Ullmann, F. Weissel, and J. Becker, “Real-time con-figuration code decompression for dynamic fpga self-reconfiguration,”in Parallel and Distributed Processing Symposium, 2004. Proceedings.18th International, april 2004, p. 138.

[11] S. Rigler, W. Bishop, and A. Kennings, “Fpga-based lossless datacompression using huffman and lz77 algorithms,” in Electrical andComputer Engineering, 2007. CCECE 2007. Canadian Conference on,april 2007, pp. 1235 –1238.

[12] (2011, Sep.) Gzip compression/gunzip decompression core.[Online]. Available: http://www.cebatech.com/products/altracores/gzip-compression.html

[13] (2011, Sep.) Locallink interface. [Online].Available: http://www.xilinx.com/products/intellectual-property/LocalLink UserInterface.htm

[14] I. Shcherbakov, C. Weis, and N. Wehn, “Bringing c++ productivity tovhdl world: from language definition to a case study,” in SpecificationDesign Languages, 2011. IC 2011. Forum on, sept. 2011, pp. 76 –82.

[15] (2011, Sep.) Visualhdl website. [Online]. Available:http://visualhdl.sysprogs.org/

[16] (2011, Sep.) Large text compression benchmark. [Online]. Available:http://mattmahoney.net/dc/text.html

[17] (2011, Sep.) Compression performance analyzer. [Online]. Available:http://visualhdl.sysprogs.org/LZSS

446446453