parallel hashing,

Parallel Hashing,

Compression and Encryption

with OpenCL under OS X

Vasileios Xouris

Master of Science

Computer Science

School of Informatics

University of Edinburgh

2010

i

Abstract

In this dissertation we examine the efficiency of GPUs with a limited number of

stream processors (up to 32), located in desktops and laptops, in the execution of

algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression

(LZ78). For the implementation part, the OpenCL framework was used under OS X.

The graphic cards tested were NVIDIA GeForce 9400m and GeForce 9600m GT. We

found an efficient block size for each algorithm that results in optimal GPU

performance. The results show that encryption and hashing algorithms can be

executed on these GPUs very efficiently and replace or assist CPU computations. We

achieved a throughput of 159 MB/s for Salsa20, 107.5 MB/s for MD5 and 123.5

MB/s for SHA1. Compression results showed a reduced compression ratio due to

GPU memory limitations and reduced speed due to divergent code paths. The

combined execution of encryption and compression on the GPU can improve

execution times by reducing the latency caused by data transfers between CPU and

GPU. In general, a GPU device with 32 stream processors can provide us with

enough computation power to replace CPU in the execution of data-parallel

computation-intensive algorithms.

ii

Acknowledgements

I would like to thank my supervisor, Paul Anderson, for his invaluable help and

guidance. I would also like to thank Dr. Zhang Le for his helpful remarks.

I would like to thank my family that always supports me in everything I do.

Finally, I would like to thank Stefania for being patient and supportive during this

year.

iii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has

not been submitted for any other degree or professional qualification except as

specified.

(Xouris Vasileios)

iv

Table of Contents

Chapter 1 Introduction ................................................................................................1

Chapter 2 GPU and OpenCL ......................................................................................3

2.1 GPU architecture ................................................................................................... 3

2.2 Open Computing Language (OpenCL) ............................................................. 4

2.2.1 Memory model .............................................................................................. 5

2.2.2 Memory access patterns ............................................................................... 7

2.2.3 OpenCL execution model ............................................................................ 8

Chapter 3 Encryption on GPU .................................................................................. 10

3.1 Background .......................................................................................................... 10

3.2 GPU advantages and disadvantages ................................................................ 12

3.3 Relevant work ...................................................................................................... 13

3.4 Implementation of Salsa20 & Results ............................................................... 15

Chapter 4 Hashing on GPU ...................................................................................... 21

4.1 Background .......................................................................................................... 21


4.3 Relevant work ...................................................................................................... 23

4.4 Implementation of MD5 and SHA1 & Results ................................................ 24

Chapter 5 Compression on GPU .............................................................................. 29

5.1 Background .......................................................................................................... 29


5.3 Relevant Work ..................................................................................................... 32

5.4 Implementation of LZ78 & Results ................................................................... 32

Chapter 6 Putting it all together ............................................................................... 38

Chapter 7 Discussion ................................................................................................. 41

7.1 Project difficulties ................................................................................................ 42

7.2 Future Work ......................................................................................................... 44

v

Chapter 8 Conclusion ................................................................................................ 45

Bibliography ................................................................................................................... 46

1

Chapter 1 Introduction

During the last few years, there has been a lot of research focused on efficient

implementations of well known algorithms optimized for execution on the Graphics

Processor Units (GPU). GPUs offer a great architecture that can take advantage of

data parallelism very effectively. A mid range GPU device can have around 64

stream processors, a number that offers great computation power. Entry level and

mid range GPUs can be found in laptops and desktops that are used every day. Of

course, there are more specialized, high-end GPUs that offer a much bigger number

of stream processors and enormous computation power. In this dissertation, we

plan to research whether GPUs located in desktops and laptops can be used in order

to execute operations that include heavy computations such as hashing, encryption

and compression efficiently. Until now, every published work related to these

operations used powerful GPUs with hundreds of stream processors.

The motivation of this dissertation was a recent research about a fast and secure

backup system for Mac laptops [1]. The main idea of this dissertation is to use the

GPU for computations included in a backup system such as data hashing,

encryption and compression. We are planning to implement some specific

algorithms for each field and see whether we can get a speedup over a CPU

implementation or if we can get execution times that are acceptable and can assist

the CPU where possible. For the testing of our implementations, entry-level and

mid-range GPUs are going to be used with up to 32 streaming processors, which is a

number much smaller when compared to high end GPUs containing hundreds of

streaming processors. GPUs of this kind are usually located in most laptops. The

framework that will be used for the programming of our implementations will be

2

the OpenCL (Open Computing Language) framework [16]. The advantage of

OpenCL is that it gives the programmer the ability to control all available processing

units in a system including CPUs and GPUs.

This dissertation is organized in several sections. We are going to start with a

general background section on GPU architecture and a brief description of the

OpenCL framework, its capabilities and restrictions. Then we will examine the

operations mentioned above (hashing, encryption and compression) one by one. For

each one of them, a brief background and relevant work on implementations for the

GPU is given. We will also discuss how each one of them fits on the GPU

architecture and mention its advantages and disadvantages. An implementation and

results section with the outputs of our research is also available for each case. After

looking to each operation in isolation, we present some conclusions from the

combined execution of encryption and compression on the GPU. In the final

chapters there is a discussion where we describe the difficulties that we faced

during our research and implementation and we also propose some ideas for future

work.

3

Chapter 2 GPU and OpenCL

2.1 GPU architecture

Graphics Processor Units (GPUs) are specialized processors originally implemented

to render 3-dimensional graphics. The main difference between a CPU core and a

GPU is that CPU is designed to execute a stream of instructions as fast as it can,

while GPUs have the ability to execute in parallel the same stream of instructions

over multiple data. GPUs contain a number of stream multiprocessors (SM) and

each SM contains 8 stream processor cores (SP), 2 special function units, an

instruction and a constant cache, a multithreaded instruction unit and a shared

memory. GPUs have a parallel throughput architecture that allows many threads to

be executed concurrently. They are designed to handle complex computations of

computer graphics fast and efficiently. They can operate on vectors of data very fast.

Because of their nature, programmers started to use them in order to execute more

general computation-intensive algorithm by taking advantage of data parallelism.

With the introduction of frameworks such as OpenCL and CUDA, the development

of GPU versions of general algorithms became easier.

Until recently, the main problem of General Purpose Computing on Graphic

Processor Units was that only floating point arithmetic computations were

supported inside pixel shaders. Fortunately, with the introduction of G80

architecture of NVIDIA, integer data types and bitwise operations are now available

[21].

4

Figure 2.1.1 - CPU versus GPU design (source: [4])

In figure 2.1.1 we can see why GPUs are so powerful. GPUs sacrifice

sophisticated control flow in order to have a lot of stream processors on chip. Also

the size of cache memories is much smaller because GPUs hide memory latency by

executing calculations while waiting for memory access instead of using large cache

memories.

2.2 Open Computing Language (OpenCL)

OpenCL was created originally by Apple Inc. and developed later by Khronos

Group. OpenCL is a framework that gives access to applications to execute on the

GPU device. In this section, we are going to discuss how OpenCL maps to the GPU

architecture.

In OpenCL, a data-parallel application (kernel) has to be written in a specific

language similar to C99. To create parallelism, OpenCL divides the total amount of

work into workgroups. Each workgroup is further divided to work items (threads).

So workgroups are executed on SMs, and each work item is executed by a SP. The

total amount of work, called N-D Range in the OpenCL world, is a collection of

workgroups that will be executed in parallel. The distribution of workgroups along

the available SMs is taking place dynamically by OpenCL itself. Threads are further

organized by the SM in groups of 32 that are called warps and all threads within a

5

warp are executed in parallel. When a warp is delayed for some reason, another

warp is selected for execution in order to hide latency. Because of the SIMT (Single

Instruction Multiple Thread) nature of the SPs, all threads within a warp must

execute the same instruction in order to take full advantage of parallelism [4].

GPUs handle memory latency by switching between workgroups. Thousands

of threads are ready for execution at any time. Every time that a group of threads

needs to read data from the memory, immediately another group takes its place.

Unlike CPUs where the cost of switching between threads is hundreds of clock

cycles, in GPU there is no cost at all. GPU threads are very lightweight.

2.2.1 Memory model

There are several different memory spaces in the OpenCL architecture. A diagram of

the locations of different memory spaces described in this section can be found in

figure 2.2.1.1. The main and biggest memory of GPU architecture is the global

memory which is off-chip and is usually between 128MB up to several GB for high-

end GPUs. Accessing the global memory requires 400 to 600 cycles and this is the

reason why we must be careful when accessing it. Global memory can be accessed

by all work items of all workgroups. A region of global memory is reserved in order

to be used as constant read-only memory and is called constant memory.

In contrast, OpenCL’s local memory (referred as shared memory in CUDA

world) is on-chip which makes it extremely fast. Accessing the shared memory

usually takes 4 to 6 cycles. However the size of shared memory is very small,

usually 16KB, so it must be used to store data that are frequently used for

computations and updates. The shared memory can be accessed by all work-items

of a workgroup so it can also be used for communication between work items of the

same workgroup. This feature is ideal in the case where data needs to be shared

among work items.

Another useful memory space is the constant cache read-only memory which is

usually 64KB and is located on-chip. When many threads within a workgroup try to

read the same constant cache address space it just takes one transaction, otherwise

reads of different addresses are serialized. This memory space is used to speed up

6

reads from the constant memory by caching frequently used data. A similar cache,

texture cache, is also available and it is used in order to speed up reads of image

objects.

Finally, the private memory (registers) is the fastest memory and distributed

privately among the work items of a workgroup by the SM. The total amount of

memory is limited for each multiprocessor, between 8192 and 16384 32-bit registers

(32kb - 64kb), and is partitioned to threads. In case more registers are needed by a

workgroup, there will be a performance problem which is known as “register

pressure”. Registers are the best solution to store small amounts of data that need to

be used frequently [12].

Figure 2.2.1.1 - The different memory spaces of GPU (source: [4])

7

2.2.2 Memory access patterns

The way that a group of threads accesses the global GPU memory is very important.

As mentioned before, each transaction with the global memory can take 400 to 600

cycles so it is important to somehow group memory transactions requested by

different work items. GPU devices are capable of reading data of 4, 8 or 16 bytes in

a single transaction. Another restriction for this to happen is that data must also be

aligned to a multiple of the element that we are reading. This means that data of

type X must be stored in an address that is a multiple of sizeof(X) [4].

Half warps (groups of 16 threads) that are executed in parallel can be

programmed to read global memory in a coalesced way. This can happen if all

threads of the half warp access a different element in an aligned segment of global

memory (4, 8 or 16-byte words) which can result in a single 64-byte transaction, a

single 128-byte transaction or two 128-byte transactions. For NVIDIA GPU devices

of compute capability of 1.0 or 1.1 this access must also be further organized so that

threads access elements of the segment in order. For example the first thread of the

half warp must access the first element of the segment, the second thread must

access the second element and so on. For GPU devices of compute capability of 1.2

or higher this restriction does not apply. Threads within a half warp can access

different address spaces within a segment with no order and still result in a single

transaction.

Accessing the shared memory requires a little different behavior in order to

achieve a high bandwidth. Shared memory is split in 16 memory banks and in order

to achieve a single transaction each thread of a half warp (groups of 16 threads)

needs to access a different memory bank. In the case that two or more different

threads of a half warp request a transaction with the same memory bank, these

accesses take place sequentially. Only in the case that all threads of a half warp

request the same memory bank then we have a broadcast that takes place in just one

transaction.

At this point we should note that the graphic cards (NVIDIA GeForce 9400m,

NVIDIA GeForce 9600m GT) used for this dissertation have a compute capability

less than 1.2.

8

2.2.3 OpenCL execution model

The OpenCL framework is responsible for the optimal execution of a data-parallel

algorithm. The amount of work that needs to be executed is called NDRange in the

OpenCL language. NDrange is a grid of thread blocks (workgroups). Each

workgroup contains a number of work items (threads) which are executed in

parallel. The OpenCL framework discovers how many SMs are available on the

current GPU and assigns workgroups on all available SMs where they execute in

turns, so all algorithms can scale to a large number of SMs without problems.

The NDRange has to be large enough because as its size gets bigger it is easier

to hide memory latency. Each SM has the ability to allow parallel execution a warp.

All workgroups are divided in warps for parallel execution. To keep track of

different workgroups and work items during execution, each workgroup has a

unique group Id number and each work item has:

a unique local Id that is used to separate the current work item from

other work items of the same workgroup

a unique global Id that is used to separate the current work item from

all other work items in the NDRange.

Warps are threads with consecutive local and global Id’s. A representation of the

NDRange appears in figure 2.2.3.1.

9

Figure 2.2.3.1 - Representation of the NDrange (grid) of OpenCL (source: [4])

10

Chapter 3 Encryption on GPU

Traditionally, since the appearance of General Purpose computing on Graphics

Processor Units (GPGPU), GPUs were mostly used for algorithms with a lot of

computations on float data structures. Until recently, there was no integer support

on GPU which made encryption algorithms very bad candidates for execution on

GPUs due to the fact that these algorithms are consisted of complex operations on

integer data types. In the last few years, this was no longer a problem, and with the

introduction of G80 architecture, encryption algorithms were ready to take a “crash

test” on GPUs [21].

3.1 Background

Encryption algorithms are used when there is a need to transfer a message through

an unsafe communication channel. The output of the encryption process is an

encrypted message which is usually of the same size as the input and is called

ciphertext. There are two kinds of encryption: symmetric and asymmetric. In

symmetric encryption, a secret key, that both communication sides possess, is used

for encryption and decryption of the data. In asymmetric encryption, each user

possesses a secret and a public key. If user A wants to transfer a message to user B,

then it uses B’s public key to encrypt the message and then user B can decrypt it

with its own secret key that is only known to him.

In general, encryption algorithms break the original message in blocks of equal

size and process them through a function that applies on them some bitwise

operations. This function is usually repeated several times (rounds) on each block of

data. The problem here is that if each block is encrypted independently of each other

11

with the same key, then the ciphertext of each block will always be the same and this

may be a serious security problem since it can lead to replay attacks: someone might

reuse the same encrypted message and claim to be someone he isn’t or request a

valid operation using the same valid encrypted message. Malicious users can use a

large number of blocks that were encrypted with the same key in order to find some

patterns that can reveal information about the original message.

For this reason, block ciphers have different modes of operation. A block cipher

mode of operation has the responsibility of mixing each block ciphertext with some

kind of information in order to prevent replay attacks and keep encrypted data

consistent. For example, Cipher Block Chaining (CBC) xor’s the ciphertext of the

previous block with the next block’s plaintext that was previously xor’d with a

nonce and then starts the encryption process. The problem with CBC and similar

modes of operation is that the original message must be processed sequentially.

Fortunately, there exists a mode of operation that allows us to take advantage of

data parallelism in encryption algorithms and it is called Counter Mode (CTR). CTR

uses a nonce, which is some initialization variables that are different for each

execution of the encryption algorithm, and a counter; it combines them in some way

(usually by using XOR) and then encrypts the result using a secret key. The output

of the encryption process (keystream) is then xor’d with the original message block

and the result of this operation is the ciphertext. So in this mode, we are not actually

encrypting the message, but we add to it the noise that comes out of the encryption

of the counter and nonce. The counter is simply a variable that is guaranteed to be

unique for a large number of blocks so the most popular option is to use an actual

counter that starts from 0 and increases by 1 for each block. The nonce is used so

that there is randomness in the output of the XOR operation with the counter and to

avoid replay attacks. It must be unique for each encryption process. For the

decryption of encrypted data, the key and the nonce must be known.

Every encryption algorithm that wants to operate in parallel, on multiple blocks

at the same time, has to operate on CTR mode. So the information needed for

parallel execution is the block number, the nonce, the key, and the block of data. A

demonstration of CTR mode appears at the figure 3.1.1 below. CTR is the mode that

we will use for our implementation.

12

Figure 3.1.1 - The CTR mode that can process blocks in parallel for encryption

(source: [9])

3.2 GPU advantages and disadvantages

First of all, we need to present the advantages and disadvantages of encryption

algorithms on GPU implementations when compared to CPU.

The main disadvantage of a GPU implementation is that keystream data need

to be repeatedly transferred from the GPU device to the host device. In order to

have good results, we need to make sure that the communication bandwidth

through the PCI express bus between the two devices is big enough. The transfer

operation is the bottleneck of many GPU algorithms because it is very costly. The

initialization latency of the transfer is usually small and the general trend is that the

transfer time grows linearly as the size of the data increases. So moving data in very

large amounts has no real benefit and it is also not possible because of the limited

memory on GPUs [15]. Transferring data in very small amounts is also inefficient

because of the initialization latency mentioned above.

In the previous paragraph we talked about the problem of transferring data

between the host and the GPU. Fortunately, when the encryption algorithm is

executed in CTR mode, the only things that we have to transfer from the host to the

GPU are the secret key, the nonce and a counter offset. This is because in CTR mode,

we do not encrypt the original message but the combination of the nonce with the

counter offset. So the time needed to transfer this kind of data is insignificant. We

13

just need to transfer back to the host an encrypted sequence (keystream) for each

block that will then be combined with the original text on the CPU, usually by using

XOR.

GPUs are designed for fast, parallel operations on vectors of floating point data

and this is where they are really unbeatable. With the introduction of GPGPU

capable GPUs, the benefits of graphics operations could be used in more general

operations including integer support. The computation power of GPUs is by far

better than CPU. For example, the Nvidia GeForce 9400 graphics card which is

located in most Mac Mini's has 54 GFLOPS (Floating Point Operations per Second),

which is an extremely big number. The Intel Core Duo processor, which is also

located in Mac Mini's with the GeForce 9400m, has 25 GFLOPS. GPUs clearly

outperform CPUs on computation power and this is their biggest advantage.

Another advantage is that encryption algorithms are very straightforward

algorithms. They do not contain branches and this makes them ideal for execution

on GPU devices. As mentioned in previous chapters, all threads that are executed in

parallel on a GPU workgroup of threads must execute the same instruction in order

to take full advantage of parallelism offered. Since encryption algorithms do not

contain branches in the code, we can rest assured that at any given time all threads

execute the same instruction but on different data and as a result no thread will

need to wait for other threads to finish the execution of a different code path.

3.3 Relevant work

Several encryption algorithms have been tested on GPUs with various speedups

during the last years. The results of these studies are very encouraging and GPU

seems to be an ideal platform for the execution of encryption algorithms.

Before the appearance of OpenCL and CUDA, the traditional OpenGL graphics

pipeline was used to take advantage of GPU computation power. Fortunately with

the introduction of these frameworks things became easier. Now the GPU can be

seen as a device similar to the CPU and through frameworks such as OpenCL and

CUDA, developers can distribute the encryption process without the need to know

very low level of graphics stuff. We will look briefly some traditional graphics

14

pipeline implementations and some CUDA/OpenCL implementations in more

detail because this is our approach for this dissertation. Most implementations for

encryption on the GPU choose the AES algorithm.

In [10], the Advanced Encryption Standard (AES) is implemented and tested on

GeForce 7900 GT which results in 5-6x speed up over a CPU implementation that

runs on an Intel Core 2 Duo (1,86GHz). The encryption rate achieved was 12 Mb/s.

In [14], an AES encryption implementation is created by using the graphics pipeline

and the Raster Operations Unit (ROP) which results in 108.86 Mb/s. Because of the

lack of XOR support in fragment processors for prior to DirectX10 support

hardware, the XOR operation in this case takes place in ROP.

These implementations follow the traditional way of programming the graphics

pipeline and use the vertex and fragment processors for parallel computations. Data

are passed as texture elements to each fragment processor for independent

execution and they are stored to the screen frame buffer or to other textures. The

OpenGL API is used for these operations.

In [11], the Compute Unified Device Architecture (CUDA) of NVIDIA is used to

create an implementation of AES-256 that gives a peak performance of 8,28 Gbit/s

on a GeForce 8800 GTX which contains 128 stream processors. They identified the

bottleneck of their implementation to be the transfer of data between the GPU and

the host device due to the limited bandwidth of PCI Express. They chose a block

size of 1024 bytes and each processed block is loaded in shared memory for further

parallel processing. Their implementation also appears to be faster when a large

number of blocks are transferred on the GPU each time. A pretty much same AES

implementation approach appears in [17] but this time the OpenCL framework is

used. In this work a GeForce 8600 GT and an ATI Firestream 9270 (800 stream cores)

is used. The results show a speedup by a factor of 11 over a sequential

implementation on a Dual Core Intel E8500.

In this dissertation, a similar algorithm to AES, Salsa20 is going to be

implemented. Unfortunately there are no relevant academic papers on GPU

implementation of the Salsa20 algorithm but the presented work in this section can

be used as a starting point.

15

3.4 Implementation of Salsa20 & Results

For the purposes of this dissertation, we decided to use the Salsa20 encryption

algorithm in CTR mode optimized for execution on the GPU [5]. Salsa20 is a stream

cipher developed by Daniel Bernstein. The reason why we decided to implement

the Salsa20 algorithm is that it seems to be faster than AES. In fact the 20 rounds of

Salsa20 algorithm are faster than 14 rounds of AES. For example, Salsa20 requires

3,93 cycles/byte while AES requires 9,2 cycles/byte at its best reported performance

[27]. This fact makes Salsa20 ideal for systems that require a high throughput like

backup systems. Also Salsa20 is a stream cipher which means that it has the ability

to produce encrypted output of equal size as the input. AES is a block cipher

meaning that the input size must be a multiple of the block size. To satisfy this

condition, we need to add padding to the last block in most occasions. This can be a

problem because the encrypted output will be slightly bigger than the input and this

can cause problems in systems where we need to process a large number of files

(like backup systems).

Salsa20’s basic operations are XOR, rotation and 32-bit addition. It is a stream

cipher that in order to operate needs a 128 or 256-bit key, a 64-bit initialization

vector and a 64-bit counter. It consists of 20 rounds of mixing operations. It has the

ability to operate on different blocks of a 64kb size in parallel while running in CTR

mode that we described in the previous section. This feature, in addition to the fact

that it contains a lot of arithmetic and bitwise operations and no branches, makes it

ideal for execution on GPU.

The first step that we need to take is to split the Salsa20 code in two parts. The

first part is the encryption process that creates a block keystream, and the second

part is the action of the actual mixing of the keystream with the original data (by

using XOR). The keystream is independent of the original data so we can just

calculate the keystream on GPU and then transfer it back to host in order to XOR it.

In this way we can store the original data on CPU memory and reduce the

transferring time by just transferring the generated keystream back to the host

device.

Because of the fact that each work item needs to have knowledge of its counter

number, the designed kernel, apart from the nonce and the secret key, takes also as

16

parameters the following:

A bytes offset, which contains information about how many bytes have

been processed until now.

A block size, which gives information about how many bytes each work

item is responsible for in order to create an appropriately sized

keystream.

The total number of bytes transferred to the GPU this time. This

information is used in the case that the total work isn’t divided exactly

by the block size. So the last work item will need to produce a smaller

sized keystream.

When a work item wants to calculate its block number, it has to first calculate

its position inside all work items of all work groups and then by taking into account

the block size, it can calculate its block number by adding the block offset. A

demonstration of this method appears below:

uint myGroupId = get_group_id(0);

uint myLocalId = get_local_id(0);

uint gsize = get_global_size(0);

uint lsize = get_local_size(0);

uint groupBlockSize = lsize*blocksize;

uint from = myGroupId*groupBlockSize +

myLocalId*blocksize;

uint to = from + blocksize - 1;

if (from >= totalbytes) return;

if (to >= totalbytes) to = totalbytes-1;

ulong myBlock =

(bytesOffset / 64) +

((myGroupId * groupBlockSize) / 64) +

((myLocalId * pr_block_size) / 64);

Figure 3.4.1 - The calculation of counter offset (“myBlock”) for each work item

block

17

The nonce, bytes offset and block size are passed in global memory and are

used by all work items. All work items of a workgroup can read the nonce from the

same memory address which results in just one transaction. Based on previous

works on other encryption algorithms that found a relatively small optimal block

size of 1024 bytes [11], we chose to process data through registers and not shared

memory for better performance. The results are written to the global memory in

chunks of 16 bytes (128 bit).

A very important issue is that we need to define an optimal block size. By

saying “block size” we mean the amount of data that is distributed to each work

item. A large block size will not cause private memory problems since the keystream

is generated in blocks of fixed size (64 bytes), then it is written in global memory,

and the same private memory is used to generate the next keystream block. A large

block size, however, will cause global memory problems because of the amount of

data of the generated keystream. We need to try different block sizes and find the

optimal for this method. Here we should note that the optimal block size also

depends on the hardware so it has to be decided on the runtime after acquiring the

GPU device’s information about the maximum number of work items within a

workgroup. So for different GPUs this value may vary but not significantly. For

example, let us suppose that our GPU device supports a total of X parallel threads

and we decide a block size of Y. This is a data size that can be executed in parallel.

To hide latency, we need to pass on the GPU a multiple Z of this size. The product of

XYZ must be less than the total amount of supported GPU memory or we will get a

compiler error.

Finally, for the transfer of data between the host device and GPU global

memory, pinned memory was used. Pinned memory can provide higher transfer

rates between the two devices which can reach 5GB/s on PCIe x16 Gen2 cards [12].

To test our implementation of Salsa20 we used two different graphic cards,

NVIDIA GeForce 9400m and GeForce 9600m GT. We should note that these cards

can be found in laptops or desktops of normal users. The results from these two

cards were compared to a single-threaded and a multithreaded implementation on

an Intel Core 2 Duo at 2.26 GHz. The specifications of GeForce 9400m and 9600m GT

appear below:

18

Model GeForce 9400m GeForce 9600m GT

Streaming Processors 16 32

Memory 256 Mb 256 Mb

Clock 1100 MHz 1250 MHz

Table 3.4.2 - Technical specifications of the graphic cards used for testing

In the next figure we present the resulting times of the encryption of a 200 MB

file. The times that appear include tests that used different block sizes. The block

size refers to the amount of data given to each work item for processing. Execution

times were measured by using the average of 10 executions. We tried different block

sizes in the range of 64 bytes to 16 KB. Bigger block sizes were not tested because of

the limited GPU memory. In the next figure, and generally in all figures from now

on, we will use the following abbreviations:

GF 9600m GT - for the execution on the NVIDIA GeForce 9600m GT using

OpenCL.

GF 9400m - for the execution on the NVIDIA GeForce 9400m using OpenCL.

CPU 1-thread - for the sequential execution of the algorithm on CPU (Intel

Core 2 Duo).

CPU OpenCL - for the multi-threaded parallel execution on the CPU (Intel

Core 2 Duo) by using the OpenCL framework. As mentioned before,

OpenCL can be used to handle parallel execution on heterogeneous devices,

including CPUs, by distributing work to available cores. So, by using

OpenCL for CPU execution we can take advantage of all available CPU cores

and gain maximum performance on CPU.

19

Figure 3.4.3 - Execution times of Salsa20 for all devices using different block sizes

The results we got are very interesting. We can see that the performance of both

GPU devices is faster than this of a single threaded CPU implementation. The

execution times of the 32 streaming processors of 9600m GT can compete with the

times of a multithreaded GPU implementation. The important point in this graph is

that GPU performance is maximized for relatively small block sizes between 64 and

256 bytes but it is also acceptable for sizes up to 2048 bytes. For very large block

sizes the performance drops considerably. The main reason for this is that with large

block sizes, each thread has to write more data in global memory and it is more

difficult to hide memory latency. By using smaller block sizes we can take advantage

of data parallelism between work items more easily and make sure that we are not

loosing performance due to memory latency. The CPU implementations don’t seem

to be much affected by the block size. The 9600m GT best execution time is very

close to the respective multithreaded CPU time. Finally the best throughput

achieved by a GPU implementation was this of 9600m GT which was equal to 159

Mbytes/s. The respective value for the multithreaded CPU implementation was 180

Mbytes/s, for the single threaded CPU implementation 49 and for the 9400m was 93

Mbytes/s. The throughput is almost double for the 9600m GT when compared to

9400m. This is because 9600m GT contains the double amount of streaming

processors.

0

1000

2000

3000

4000

5000

6000

64 512 4096

Tim

e (

ms)

Block Size (bytes)

GF 9600m GT

GF 9400m

CPU OpenCL

CPU 1-thread

20

Device Throughput (Mbytes/s)

GeForce 9400 93

GeForce 9600m GT 159

CPU single thread 49

CPU OpenCL 180

Table 3.4.4 - Throughput measurements of execution of Salsa20 on different devices

In general, the results that we got show that GPUs with a small number of

streaming processors can be used effectively in order to achieve a high throughput

for the Salsa20 algorithm. The more stream processors we have available, the better

the throughput we can achieve. The results show that with a number of stream

processors greater than 32, we can achieve better times on GPU.

Finally, the results cannot be compared to related work for two reasons: the first

one is that there isn’t published relevant work for the Salsa20 algorithm on GPU and

the second is that the related work in this field uses GPUs with huge computation

power and hundreds of cores. We used GPUs with up to 32 stream processors so we

cannot make a comparison. For example, in [11] a throughput of 1035 MB/s is

achieved for the AES-256 using a GPU with 128 stream processors. Our best GPU

used 32 stream processors for Salsa20 and achieved 159 MB/s.

21

Chapter 4 Hashing on GPU

4.1 Background

Hashing algorithms have the ability to create a fixed-sized data sequence from a

variable-sized data sequence. In this section, we are going to deal with hashing

algorithms that are used to compute a message digest (fingerprint) of data

sequences. The main characteristics of hashing algorithms are that they can compute

a fingerprint from a large data sequence pretty fast, but the reverse procedure is

impossible and also it is unlikely, with a very low probability, to get the same

fingerprint from 2 different inputs.

These algorithms can help us to identify if there was a transmission error or

some other malfunction that resulted in the alteration of some of the original data.

For example, the digest of a file can be generated at some point; when someone else

wants to copy or download this file he can check if the downloaded file has the

same checksum as the original file. If not, then he knows that there was an error

during transmission and he can try again. The digest doesn’t have to be generated

for a whole file, but we can instead create and check the digests of different blocks of

data transmitted.

Another important point that we need to mention is that this kind of algorithms

are not parallelizable. The reason for this is that in order to compute the message

digest of a file, we need to process all data of this file through the hashing algorithm

sequentially. So we are not allowed to split the file in blocks and process them

independently in parallel. This would only work if we kept the digest of each

processed block of data, which could result in a lot of disk space occupied by

22

checksums of different blocks of the same file instead of having a single fixed size

digest for the whole file. Of course, this attribute of hashing algorithms is desirable

because files with the same content but in different order must generate different

digests. So every block of data processed must take into account the output of

previous blocks of the same data stream. In general, the high level of hashing

algorithms has this form:

1. Initialize digest variables

2. Process next block of data stream (fixed size, usually 512 bits)

3. Apply the hashing function on this block (which modifies the digest

variables)

4. If there are more blocks to process from the same stream, go to step 2

5. Output digest variables (fixed size)

It is easy to understand that we cannot parallelize this algorithm by processing

different blocks because each new block must use the modified variables of the

previous block.

So how can we take advantage of the parallel nature of GPUs in order to

compute digests of data faster? There are 2 main approaches. The first one is to give

at each GPU thread a different block of the same data stream in parallel and keep a

digest for each of these blocks for later reference. As mentioned before, however,

this would need a lot of extra disk space to store all computed digests. The second

one is to use many independent data streams and let GPU process one block at a

time from each data stream in parallel. At the next step, another block from each

data stream is processed. In this way, all blocks that depend on each other will be

processed sequentially but in the same time we can take advantage of GPU

parallelism. Of course, this approach requires a large number of different data

streams that can be processed in parallel and, in fact, this number must be much

larger than the maximum number of concurrent threads within a GPU device to

help hide memory latency.

23


For the purposes of this project, the MD5 and SHA1 algorithms [7][8] were chosen

for testing on GPU. The structure of these algorithms is similar to the one described

in the encryption section. MD5 and SHA1 do not contain branches, and are based on

arithmetic and bitwise operations such as XOR, AND, OR, NOT, left bit rotation,

right shifting and addition modulo . Another very important advantage of

hashing algorithms is that the output of a large data sequence is very small (128 to

512 bits depending on the algorithm). This minimizes the time needed to transfer

the results from the GPU device back to the host. In fact, the MD5 algorithm

produces a 128-bit digest and the SHA1 algorithm a 160-bit digest. We already know

that data transfer to and from the GPU device can be a bottleneck but in the case of

hashing we do not have to worry a lot about moving data back to the host because

the output of each block has a small fixed size.

The biggest disadvantage of hashing algorithms is their sequential nature that

does not allow us to operate on different blocks of the same data stream in parallel.

We can, however, operate in parallel on different data streams. Additionally, another

disadvantage is the large number of blocks that we need to transfer on the GPU.

Although we do not need to transfer back a lot of information, the amount of data

transferred to the GPU can still be a bottleneck.

4.3 Relevant work

A lot of background work of GPU hashing in industry was focused on cracking

digests. At this moment, many programs available on the Internet exist that are able

to use the available GPU devices of a system in order to crack MD5 and SHA1

password digests. A digest cracker tries to find a data sequence that can result in a

given digest when processed through a specific hashing function. The way to do this

is to calculate the digest of many relatively small data sequences until the result

matches the given digest. This is the approach that we discussed in the previous

section which processes many different data streams in parallel. The most well-

24

known program available is the Lightning Hash Cracker by Elcomsoft reaching a

brute-force peak performance of 608 million passwords per second on a GeForce

9800 GX2 (2 x 128 stream processors) [19].

In academic literature, there are a limited number of published papers for MD5

or SHA1 hashing on the GPU. Most academic works so far for algorithms such as

MD5 and SHA1 followed a FPGA-based approach. In [20], there is a detailed

implementation of the MD5 algorithm on GPU which computes MD5 digests of

small blocks of data of the same size in parallel. Again, the main bottleneck of the

implementation appears to be the small bandwidth of PCI express compared to the

computation power of the GPU device. Each thread is assigned to a 512-bit space of

shared memory that uses in order to store each processed chunk of data for further

processing. The main limitation of this approach is that due to the limited shared

memory (16KB) the implementation can be tested for thread workgroups with less

than 256 work items. A bigger number of work items would require a bigger shared

memory size. The results of this work show a peak performance of 1400Mbps for a

large input size using a NVIDIA GeForce 9800 GTX+ (128 stream processors). Other

implementations [18] use the constant memory which can be fast because of the

constant memory cache that is located on-chip. In this paper the SHA-1 algorithm is

implemented on GPU and achieves a rate of 2,5 GB/s in a NVIDIA GeForce 9800

GTX+ (128 stream processors).

4.4 Implementation of MD5 and SHA1 & Results

For the MD5 algorithm, the "RSA Data Security, Inc. MD5 Message Digest

Algorithm" [7] was used as a starting point. Some modifications were needed in the

code in order to compile for execution on the GPU. These modifications included

the removal of not supported code and also a duplication of the “Md5Update”

function so that it can support pointer parameters that refer to different address

spaces (vector variables in registers and GPU global memory). For the SHA1

algorithm, a simple implementation was used that can be found in [26].

For both algorithms a similar approach is used. Data are passed to the GPU’s

global memory in large blocks. Then the hardware scheduler of the GPU creates

25

workgroups according to the given parameters. Each work item of a workgroup can

identify its position in a similar way as in the encryption implementation described

in previous chapter. A large file of 200MB was used to run the tests and to simulate a

multiple data streams parallel operation. The modified code was compiled as an

OpenCL kernel. We decided to use registers for the processing of our data. We knew

from the beginning that this will force us to use small block sizes but the parallel

nature of GPU can support this decision. By using registers we are sure that we will

have very small latency when reading our data. Each thread reads its assigned block

in little pieces that process sequentially. The size we chose for these pieces was 16

bytes. The reason for this is that by using this size we can use the built-in vector

type of OpenCL “char16” and we can achieve aligned access to global memory. The

same vector type was used when storing the calculated digest back to the global

memory. The digest of MD5 is exactly 16 bytes (128 bits). The digest of SHA1 is 20

bytes (160 bits)

Again for the transfer of data between the host device and GPU global memory,

pinned memory was used just like in the encryption implementation. Pinned

memory can provide higher bandwidth.

For the testing procedure, we used the same graphic cards and CPU as in the

encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For

specifications please refer to table 3.4.2. In the next figures we present the resulting

times of the MD5 and SHA1 hashing of a 200 MB file. Please note that in both cases,

GPU and CPU implementation, each block of data was treated as a separate data

stream in order to simulate an environment with multiple independent data

streams. The times that appear include tests that used different block sizes. The

block size refers to the amount of data given to each work item for the calculation of

an independent MD5 hash. The execution times were acquired using the average of

10 executions.

26

Figure 4.4.1 - Execution times of MD5 for all devices using different block sizes

In figure 4.4.1, the results of MD5 algorithm are presented. We can see that for

small block sizes, the single-threaded CPU implementation appears to be faster than

the 9400m GPU. As the block size grows we can see that the 9400m GPU takes a

significant lead in front of the single threaded CPU implementation. After the

4KBytes block size, there is not significant increase for the 9400m GPU. The CPU

execution times appear to be almost irrelevant of the block size. Both the single

threaded and the OpenCL CPU implementations are not affected a lot by the block

size. The 9600m GT performance is almost 2 times better than this of 9400m. The big

difference in execution times between 9400m and 9600m GT comes from their

difference in the number of stream processors (16 vs 32) and in their clock frequency.

Both GPU implementations are faster than the single threaded on CPU. The

multithreaded CPU implementation seems to be the fastest but by using a more

powerful GPU with more stream processors we can get a speedup.

As a result of figure 4.4.1, we can say that an optimal block size for each work

item in the MD5 GPU implementation is between 1024 and 4096 bytes. Very small

block sized are not good for GPU implementations. This is due to the fact that more

and more work items require transactions with the global memory in order to read

data. In this case, hiding latency is not very efficient because of the small number of

computations that each work item has to do when compared to the amount of data

that are read and written back to the global memory. For example, a block size of

0

1000

2000

3000

4000

5000

6000

7000

64 512 4096

Tim

e (

ms)

Block Size (bytes)

GF 9600m GT

GF 9400m

CPU OpenCL

CPU 1-thread

27

8192 bytes computes and requires a write transaction of 128 bits every 4096 bytes,

while a block size of 64 bytes requires the same transaction executed 128 times more.

The difference between the hashing algorithm and the encryption algorithm

that we discussed in the previous chapter is that in this implementation each work

item also needs to read data from the global memory and this appears to be the

bottleneck here.

In table 4.4.2, the throughput achieved appears measured in Mbytes/s. The

maximum GPU throughput of 107,5 MBytes/s was observed with the GeForce

9600m GT.


GeForce 9400m 57,2

GeForce 9600m GT 107,5

CPU single thread 48,8

CPU OpenCL 190,5

Table 4.4.2 - Throughput measurements of execution of MD5 on different devices

To conclude the MD5 section, we can say with certainty that GPU devices with

a small number of stream processors, available in most desktop and laptops, can be

used for MD5 computations efficiently and also can be used in co-operation with

CPUs for maximum results. A number of at least 32 stream processors or more is

desired in order to achieve a good performance.

In figure 4.4.3 and table 4.4.4 that can be found below we present the results of

SHA1 implementation. We can see that the results are pretty similar to those of

MD5. This is natural since SHA1 is based on the principles of MD5. The analysis of

the results is also similar to MD5. The general trend is that as the block size grows,

the execution times are improved but after the block size of 512 bytes there is not a

significant improvement. Again the multithreaded CPU implementation seems to be

the fastest but execution times on GPU devices are improved as the number of

multiprocessors grows (16 vs 32 stream processors of 9400m and 9600m GT

respectively). So a GPU device with 32 or more stream processors can really assist or

28

replace the CPU in SHA1 hashing computations. The 32 stream processors of 9600m

GT seem to be enough to replace the CPU in the calculation of SHA1 digests.

Figure 4.4.3 – Execution times of SHA1 for all devices using different block sizes


GeForce 9400m 51,9

GeForce 9600m GT 123,5

CPU single thread 30,4

CPU OpenCL 155

Table 4.4.4 - Throughput measurements of execution of SHA1 on different devices

0

1000

2000

3000

4000

5000

6000

7000

8000

64 512 4096

Tim

e (

ms)

Block Size (bytes)

GF 9600m GT

GF 9400m

CPU OpenCL

CPU 1-thread

29

Chapter 5 Compression on GPU

5.1 Background

Compression is an essential operation. A lot of data are compressed every day in

order to reduce their size and make them more suitable for transfer over the

Internet. There are two different types of compression: Lossy and lossless

compression. Lossy compression refers to compression algorithms that try to reduce

the size of a file with a cost on its quality. Lossy compression is used on photos,

sounds, videos and, more generally, on files where the main characteristics are still

recognizable when their quality is not so good.

On the other hand, lossless compression refers to compression algorithms that

can compress a file by reducing its size, but after decompression we are able to get

back the file that was originally compressed. This kind of compression is mostly

used on files such as text files, executable files etc. In this section, we are going to

research a little further the prospects of lossless data compression on GPU. There are

many different compression algorithms that take advantage of the fact that data

sequences contain large identical sub-sequences that we can encode with smaller

representations. We are going to implement the dictionary-based Lempel Ziv 78

(LZ78) algorithm [13] for execution on GPU so this is a good place for a brief

description of this algorithm. Directory-based algorithms are often used because of

their simplicity and simple algorithms operate better on the GPU.

The LZ78 algorithm uses a dictionary that updates while traversing the

available data and it also keeps a copy of the largest sequence found in the

dictionary (called prefix). Input is processed byte by byte. Each time a new character

30

is read, a search is taking place to find out if the sequence {prefix + new character} is

present in the dictionary. If it is present, we update the prefix with the new character

and we keep reading characters following the same procedure until a match in the

dictionary cannot be found. At that point, we update the dictionary with a new

entry that contains the sequence {prefix + new character}, we reset the prefix and we

output the sequence {position of the prefix in the dictionary + new character}. This is

a compressed sequence. This procedure continues by constantly updating the

dictionary with new sequences and by outputting references to it until there is no

more input. The opposite operation, decompression, follows the same technique by

constructing a similar dictionary and by following the references.


After getting a more clear understanding on how lossless compression algorithms

work, we will present which of their characteristics prevent the full exploitation of

the GPU’s computation power and how we can deal with these problems. Here we

should note that the problems of moving data to and from the GPU discussed in the

section about encryption and hashing, also apply here.

Synchronization. The main idea behind compression algorithms is to find

repeated sequences of characters in a file and replace them with a shorter

representation depending on their frequency in the file. This operation is optimized

when we can have a central dictionary structure that controls the execution of the

algorithm and optimizes the compression ratio by keeping as much information as

possible. As mentioned in previous chapters, GPU likes to execute a lot of threads in

parallel meaning that these threads must operate on independent data. This also

means that each thread cannot make use of information gathered by other threads

unless there is some kind of synchronization between them which would slow

down the whole procedure. Then those data should be moved back and forth

between the GPU and the host device in order to feed next blocks. Those restrictions

would make the algorithm even more complex. The only efficient way to implement

a lossless compression algorithm on GPU is to sacrifice compression ratio in order to

get the wanted parallelization. This can be done by compressing different blocks of

31

data independently treating them as different streams of data. This will reduce our

compression ratio a little but will speed up the whole procedure.

Complex and branched algorithms. Compression algorithms contain a lot of

branches in their code, a lot of “if” and “while” statements that force different

threads to follow different paths of execution some times. As a result different

threads execute different instructions which results in a sequential execution at

some parts of the code between threads. There is not much that we can do to avoid

this in a GPU implementation so this is an important disadvantage. Also

compression algorithms do not contain arithmetic operations and is all about

searching for patterns. So we cannot take advantage of the computation power of

GPUs.

Another important issue, when dealing with GPUs, is the limited memory

supplied and the restrictions for memory allocation of current parallel programming

frameworks for GPUs like OpenCL and CUDA. Dynamic memory allocation is not

supported in running kernels so we need to know in advance information about the

size of the current block. When dealing with compression and decompression, the

amount of memory needed for the compressed/decompressed data is not always

known in advance. A way to overtake this problem is to make some conventions

that will help us deal with it. For example, the compressed size of a block of data

can have a maximum size equal to the original size plus some header information

about the compressed block. To decompress a block, we will need to know in

advance the size of the original block by reading the appropriate header information

so that we can easily allocate the memory required for decompression. Apart from

this, compression algorithms need to allocate memory for a number of sub-

operations. This requires a re-implementation of the compression algorithm in order

to follow the GPU framework standards. A successful GPU implementation must

supply enough pre-allocated memory to the (de)compression kernel in order to

successfully (de)compress all blocks without running out of memory resources. The

limited GPU global memory and the large number of concurrent threads that deal

with different blocks is an important problem that needs to be solved.

All problems discussed above plus the complex nature of compression

algorithms must be taken into account. The main structure of the algorithm needs to

be optimized and modified in order to satisfy all GPU restrictions and to take

32

advantage of all GPU benefits.

5.3 Relevant Work

There are no relevant academic papers on lossless data compression on GPU. In

contrary there are a lot of research papers on lossy compression and especially on

lossy image compression algorithms on GPU because of the fact the GPUs are

optimized for handling image files. The fact that there are no relevant academic

papers can be explained because of the nature of lossless compression algorithms.

As described in the previous section, these algorithms do not fit well on the GPU

architecture.

Nevertheless, there is relevant work on algorithms for parallel block

compression in general, which is the method that we will use in the implementation

part. In [22] a parallel block compression approach is used in order to achieve

speedup to dictionary based compression algorithms. Because the parallel

processing of blocks may result in a reduced compression ratio with independent

dictionaries, a joint dictionary construction is proposed where different compression

processes reference a shared dictionary.

A very famous block compression program is bzip2 [23] which uses a

combination of some famous compressions algorithms including the Burrows–

Wheeler transform [24] and the Huffman coding algorithm [25]. This algorithm

works on blocks and compresses each block independently. The problem is that it

operates on large blocks, usually between 100 and 900 Kbytes, which makes it a bad

candidate for GPUs due to limited memory.

5.4 Implementation of LZ78 & Results

For the compression algorithm, the LZ78 algorithm was chosen. Before this choice,

many other zip libraries were examined such as bzip, gzip and others but these

libraries were too complicated for the GPU architecture: too large code with a lot of

branches and heavy memory operations. This is the reason why we decided to

33

implement a LZ78 version that can fit well on the GPU and then test it in practice.

Dictionary-based compression algorithms are often used because of their simplicity.

We must note that this implementation was created for the GPU architecture; other

CPU implementations can be a lot faster than this because of their large memory

and freedom of memory allocations. For the purposes of this dissertation, we decide

to create an implementation that can fit to the GPU architecture and test it in several

devices.

Our main concern was to find ways to speed up the compression process as

much as possible. From the beginning, it was clear that our bottleneck would be the

transferring process to and from the GPU. For this reason, we have to choose a

relatively large global size of data to be compressed each time with respect to the

total available memory of the GPU device. Of course, we must keep in mind that

these parameters depend on our hardware and the PCI express bandwidth. On

different systems, we need to make sure that the full bandwidth is used.

The main idea for the compression on GPU is to split the data and process

blocks in parallel that will be compressed independently. We can follow 2

approaches here: either give a block of data to a workgroup or give a block of data

to each work item. The first approach can lead to better compression ratio but it

needs some kind of synchronization between work threads. The idea is to create a

shared dictionary for each workgroup that all work items within it will be able to

update and to reference it. The problem with this approach is that synchronization

will lead to delays and will reduce efficient use of parallelism. This approach will

not be implemented but can be considered as a potential future work of this

dissertation.

The second approach, assigning an independent small block to each work item,

seems faster but will result in a reduced compression ratio. For the implementation

part, we will use this approach.

Another issue is the dictionary size of each work item. LZ78 uses a dynamic

dictionary that is created during the compression process but because of memory

issues on GPU we need to set a limit to its size. The bigger the size, the better

compression ratio we will achieve. Due to the large number of threads that the GPU

platform needs, the dictionary size has to be small. When the dictionary is full and

we want to add a new entry, we do so by replacing the oldest entry of the dictionary

34

with the new sequence. Instead of using registers to store the dictionary, we can

also use the shared workgroup memory that has a bigger capacity, usually 16Kb,

and can be as fast as accessing registers when there are no memory bank conflicts

between threads asking for a transaction. Shared memory, unlike global memory,

can serve multiple transactions, up to 16, by different work items in parallel. For our

implementation, we chose to bypass the shared memory and copy small chunks of

data each time into registers for faster execution.

For the current implementation, we chose to use a small dictionary size of 256

entries for a number of reasons.

1. The first and most important reason is because of the limited GPU

memory. Each work item must have a small dictionary if we want to

guarantee that we are not going to have memory problems.

2. The second reason is because we need a small number of bits to represent

a reference in the dictionary. So a 256-entry of the dictionary can be

referenced with 8 bits.

3. Another reason is that our implementation uses a sequential search to

find a match in the dictionary so a large dictionary size would result in

more search time.

As we said before, the OpenCL framework does not support dynamic memory

allocation and this may be a problem in compression/decompression functions

because we cannot be sure of the compressed and decompressed size. To bypass

memory allocation issues we will make some conventions:

When each work item completes the compression of a block of data, it also

needs to save the compressed data size. In this way, at the time of decompression

the decryption function will know that the next compressed data size read will

result in an unzipped sequence of a fixed block size. So we can pre-allocate the

memory needed.

For the encryption part, buffers of the same size as the input data size were pre-

allocated to store the encrypted data. We will make a convention that if the

encrypted data results in a bigger size than the input, then the input will be stored

unchanged.

35

For the testing procedure, we used the same graphic cards and CPU as in the

encryption (GeForce 9400m, GeForce 9600m GT, Intel Core 2 Duo at 2.26 GHz). For

specifications please refer to table 3.4.2.

In this section we present the resulting times of compressing a 9,3MB file by

using our LZ78 implementation. The times that appear include tests that used

different block sizes. The block size refers to the amount of data given to each work

item for the compression of a different block of data. For all tests, a dictionary size of

256 entries was used.

Figure 5.4.1 - Execution times of LZ78 for all devices using different block sizes

In figure 5.4.1, we can see the results of the implemented LZ78 algorithm. The

execution time is reduced in GPUs when using a relatively small block size between

128 and 1024 bytes. This is because small block sizes don’t contain a lot of

information in order to take full advantage of the dictionary and as a result there are

fewer replacements and this cause fewer threads to follow different paths. As the

block size grows more and more threads follow different paths. The 9400m

performance is always slower than the sequential CPU implementation. The 9600m

GT execution seems improved: execution times are reduced by 50% when compared

to those of 9400m. Again, this can be explained from their difference in stream

processors (16 vs 32). The performance of 9600m GT is always better than the

2000

7000

12000

17000

22000

64 256 1024 4096

Tim

e (

ms)

Block Size (bytes)

CPU OpenCL

GF 9400m

GF 9600m GT

CPU 1-thread

36

sequential CPU implementation but for large block sizes the performance drops. In

general, the LZ78 algorithm performs better when running as a multithreaded CPU

program (OpenCL CPU).

Before making any assumptions, we have to look how these block sizes behave

when it comes to the compression ratio achieved. The next figure presents the

compressed size achieved for the 9,3 MB file used after parallel block compression

with different block sizes.

Figure 5.4.2 - Compressed size achieved with different block sizes by using our

specific LZ78 implementation with a small fixed sized dictionary

We can see that for very small block sizes, the compressed size is too large,

nearly unaffected. This is because small block sizes don’t give the capability to the

algorithm to fill all available positions of the dictionary. The chosen dictionary size

was 256 entries so data of 64, 128, 256, 512 bytes cannot take full advantage of it

because each entry can contain several bytes. Fewer dictionary entries mean fewer

possible compressed sequences. That is why we see an improvement after a block

6

6.5

7

7.5

8

8.5

9

64 512 4096

Co

mp

ress

ed

Siz

e (

MB

)

Block Size (bytes)

Compressed size

37

size 512 bytes. A CPU implementation with an infinite (or very large) dictionary size

would give much improved compressed sizes.

As a result from figures 5.4.1 and 5.4.2 we can state that for the specific

parameters we selected, an optimal block size for each work item would be between

512 and 1024 bytes because these sizes give good execution times and a relatively

good compression ratio.

To conclude, the results show that GPU memory limitations can be very

harmful for the resulting compressed size. Also, the nature of compression

algorithms doesn’t allow the exploitation of GPU computation power. GPUs are not

yet ready for this task.

38

Chapter 6 Putting it all together

In this chapter, we will examine how we can combine some algorithms discussed

earlier for execution on the GPU in order to process a single stream of data more

efficiently. We already know that a stream of data can be divided in little blocks for

parallel encryption and compression. Hashing algorithms, on the other hand, are

strictly sequential and have to operate on each block in order. Combining a

sequential algorithm with parallel algorithms is not optimal on GPU.

So in this section we will discuss how compression and encryption can be

combined on the GPU in order to get the maximum performance. The idea is to

move blocks on the GPU, compress them, then encrypt them and finally transfer

them back to the host. By combining these two operations on the GPU, we can

reduce the time required to transfer data from the host device to the GPU and back

when compared to executing encryption and compression independently (figure

6.1). We could also say that a compressed stream of data results in a reduced amount

of data for encryption. Unfortunately the exact size of compressed data cannot be

known in advance, so buffers must be allocated and data need to be transferred

back for the worst possible scenario. In figure 6.1, the red arrows represent

operations that need recurrent transfer of data and make heavy usage of the PCI

express bandwidth. On the other hand, green arrows indicate operations that

happen immediately. So from figures 6.1a and 6.1b it is clear that by combining

encryption and compression we can reduce the total time needed to move data

between the two devices. Heavy transfers through the PCI express are reduced from

3 to 2.

39

Figure 6.1 - (a) Encryption and compression executed separately,

(b) Combined execution

Our goal at this moment is to decide on an efficient block size that will fit both

to encryption and compression. According to the results that we presented in the

encryption chapter, small block sizes up to 2048 bytes have the best performance.

On the other hand, the results from the compression indicate that block sizes smaller

than 1024 bytes suffer from reduced compression ratio. In general, larger block sizes

result in an improved compression ratio but if we get into account the incapability

of GPUs to supply enough memory we soon realize that we cannot use very large

block sizes. We need to have a very large number of threads on the fly, and each one

will be assigned to a block of data. So the limited GPU memory available prevents

us from satisfying both conditions. An efficient block size for both encryption and

compression seems to be between 1024 and 2048 bytes. The procedure of this

combined operation appears in figure 6.2. Each work item is responsible for a block

of data equal to the chosen block size. It compresses the block and then it encrypts

the output of compression. After this, it stores the final output size of the

compressed block and the compressed/encrypted block (C/E) in the appropriate

place in the global memory. The size information is needed because the host needs

to know how many bytes the output size of each block was in order to recover it.

This information is also needed for the decryption/decompression operation.

40

Figure 6.2 - Each work item (Wn) compresses a block and then encrypts the

compressed output

41

Chapter 7 Discussion

In previous sections we examined algorithms and developed some GPU

implementations. In this section we will discuss in detail the results that we derived

by making a critical evaluation. The algorithms used were of different nature and

for some of them (i.e. the compression part) had to be re-implemented from scratch

in order to fit on the GPU architecture.

We did a research on three different categories of algorithms: hashing,

encryption and compression. We also examined how encryption and compression

can be executed on the GPU with just one call by determining an optimal block size

for both of them. For the hashing and encryption part, all available algorithms are

very similar and results can somehow be more general and not specific for Salsa20

or MD5 and SHA1 algorithm. On the other hand, the compression implementation

was the trickiest. The reason for this is that there exist many compression algorithms

and each one is based on a different approach. This fact results in very different

algorithms which may or may not fit on the GPU. We tried to choose an algorithm

that was relatively simple and could be parallelized easily but by sacrificing speed

and compression ratio.

The results of hashing and encryption are very straightforward: GPU

implementations are much more effective than this of a single threaded CPU

version. The results also show that more powerful GPUs can easily overcome a

multithreaded CPU implementation. Mid range GPUs can also be very efficient in

these tasks and assist or replace CPUs. For Salsa20, we achieved acceptable results

for small block sizes between 64 and 2048 bytes. In fact block sizes of 64 and 128

bytes seem to be optimal for our implementation. The results of MD5 and SHA1

gave us a peak performance for block sizes of 1024 or 2048 bytes, but acceptable

performance was in the range of 512 to 4096 bytes.

42

The results for the compression part are not very encouraging. Some

characteristics of compression algorithms are reduced as we explained in the

relevant section such as compression ratio. The block sizes that resulted in an

efficient performance, including speed and compression ratio, were 1024 and 2048

bytes. Bigger block sizes resulted in an improved compression ratio but had an

impact on speed.

Finally, the combined execution of encryption and compression operations can

have an improvement of performance. This is a natural result because every block of

data is staying for a longer time on the GPU and is used for more computations. So

the ratio of computations over amount of data is increased and this is the whole

point of parallelism on the GPU: use less data for more computations in parallel. It

would be good if the hashing part could be combined for execution on GPU with

the other two operations but as mentioned before its sequential nature prevents this.

A large block of data can be divided into multiple sub-blocks which can then be

used for independent encryption and compression. However, this cannot happen

for the calculation of a digest using a hashing algorithm.

At this point, we would like also to discuss the results based on our primary

motivation which was the use of GPU computation power for assisting the CPU in

operations required for a backup system (hashing, encryption, compression).

Hashing and encryption results were very promising on the GPU, but compression

had many problems. So an efficient backup system could use the CPU to compress

files and then send them to the GPU for the encryption and hashing part in a

pipelined way. According to the results, efficient systems need to use GPU devices

with 32 or more stream processors. In general, by taking into account all the results

of this dissertation, we can state that the performance of each algorithm is improved

nearly by 50% when the number of available stream processors is doubled from 16

to 32.

7.1 Project difficulties

During the implementation phase of this project we run into a number of

difficulties. In this section we are going to discuss some of these difficulties that

appeared to be the most important.

43

For the purposes of this project we had to implement a number of algorithms of

different kind. We found some implementations of these algorithms that we tried to

modify in order to fit in the GPU architecture. The problem was that they were

designed for optimal execution on CPU and GPU compiler doesn’t support the

entire set of the C language. For example, functions related to memory operations

such as memcpy etc are not supported by GPU kernels. So when there was a need

for copying memory we had to do it manually by replacing these functions.

Another difficulty was that GPU has many different address spaces (described

in previous chapters) so for optimal execution we had to transfer data between these

address spaces.

Also the debugging process appeared to be much more difficult than we

thought at the beginning. GPU devices do not support at the moment output

functions such as printf. Checking the content of some variables during runtime

wasn’t an easy task. We had to create an extra buffer space on the GPU global

memory where we stored any information that we needed to know for the

debugging process and then checked the content of those variables by transferring

them and outputting them on the host device. The problem with this approach is

that when there was a bug in the code that was forcing the kernel to crash then we

couldn’t reach the part where we could send the data back to the host for

examination. In this case, we had to execute small parts of the kernel until we reach

the point of the problem.

As in most parallel and distributed systems, the debugging of many different

instances that are executed in parallel was difficult. We had to coordinate the

execution of hundreds of threads which was difficult at the beginning but only until

we had our first algorithm running. The same method of coordination and

debugging was used for all algorithms. Compression algorithms were the most

difficult to be modified in order to fit to the GPU because of their complex memory

operations and of their big size. For this reason, a simple implementation of LZ78

compression algorithm was created.

44

7.2 Future Work

The subject of this dissertation included many different areas of study such as

hashing, encryption and compression. We did our best to create algorithms that can

execute efficiently on the GPU but there is always room for improvements.

During our study for the behavior of such algorithms on GPU, we realized that

there was very limited information and academic reference to the data compression

part on the GPU. Because of the limited time given for this project, we didn’t have

enough time to go very deep in this section but we think that this research can be

used as a starting point for future implementations. The proposed approach for the

LZ78 algorithm for the shared/synchronized dictionary between work items of the

same workgroup can be examined as a future work of this dissertation. Also other,

more efficient, search techniques of the dictionary instead of a sequential search

could be tried such as hash tables. The problem of limited GPU memory and of the

incapability for dynamic memory allocation prevented us from following this

approach. A research of how we can create an efficient hash table with a fixed size

for the GPU platform would be very helpful for the LZ78 algorithm and would

speed up the process by a large factor.

Also as future work, we could test these algorithms on more powerful, high end

GPUs. The GPUs that we used for our testing (NVIDIA GeForce 9400m, NVIDIA

GeForce 9600m GT) were entry-level and mid-range GPUs but served well the

purpose of this dissertation that was to examine whether laptop and desktop GPUs

could be used to speed up these operations.

Another possible extension of this dissertation can research in detail different

ways in which the CPU and GPU device can cooperate in order to achieve

maximum performance for hashing, encryption and compression in a pipelined

fashion: how can these operations be synchronized and what speedups can be

achieved over a pure CPU implementation.

45

Chapter 8 Conclusion

The computation power of GPU devices grows year by year. As this power grows,

more and more computationally intensive fields start to use it in order to achieve

greater speedups. Encryption and hashing algorithms have been already tried on

the GPU architecture and showed great speedups. Most of these speedups were

achieved by using expensive high end GPUs with a very large number of stream

processors and high clock frequencies. In this dissertation, we proved that even

entry level and mid range GPUs can be used for encryption and hashing effectively.

The results that we got from the Salsa20 and the MD5 algorithm are very

encouraging. Unfortunately, there are fields such as compression that are not yet

ready to take full advantage of GPU devices. Compression algorithms need to be

implemented with many restrictions in mind in order to run on GPU devices and

these restrictions have a cost on speed and compression ratio. In general, we can say

that GPUs with 32 or more stream processors can be used as a powerful

computation device in any algorithm that involves intensive computations.

There is too much unexploited computation power at this moment in most

users’ desktop and laptop GPUs. As our results show, this power can be used to

maximize the performance of many algorithms. In previous chapters, we referred

many times to the limited GPU memory. We believe that, in a few years, this will not

be a problem anymore and GPUs will have bigger and faster memories. As a result

of this, we believe that, in the near future, GPUs will be an essential computation

device in every user’s computer, either assisting CPUs in computation intensive

problems or even replacing them.

46

Bibliography

[1] P. Anderson and L. Zhang, “Fast and Secure Laptop Backups with Encrypted

De-duplication”, under publication in 24th Large Installation System

Administration Conference (LISA 2010), San Jose, CA, November 7–12 2010.

[2] Intel, “Intel microprocessor export compliance metrics”,

http://www.intel.com/support/processors/sb/cs-023143.htm

[3] GPU Gems 2, “Chapter 32. Taking the Plunge into GPU Computing”,

NVIDIA Corporation, 2009,

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter32.html

[4] “OpenCL Programming Guide for the CUDA Architecture, Version 3.1”,

NVIDIA Corporation, 2009.

[5] D.J. Bernstein, “The Salsa20 Family of Stream Ciphers”, New Stream Cipher

Designs: The eSTREAM Finalists, Springer-Verlag, 2008, pp. 84-97.

[6] T. Xie and D. Feng, “How to Find Weak Input Differences for MD5 Collision

Attacks”, Cryptology ePrint Archive, Report 2009/223, 2009.

[7] R. Rivest, “The MD5 Message-Digest Algorithm”, RFC 1321, MIT and RSA

Data Security, Inc., 1992.

[8] D. Eastlake and P. Jones, “US Secure Hash Algorithm 1 (SHA1)”, RFC 3174,

Motorola and Cisco systems, 2001.

[9] Wikipedia, “Block cipher modes of operation”,

http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation

[10] N. Pilkington and B. Irwin “A Canonical Implementation Of The Advanced

Encryption Standard On The Graphics Processing Unit”, In the Innovative

Minds Conference, Johannesburg, South Africa, 7 - 9 July 2008.

47

[11] S. Manavski, “CUDA Compatible GPU as an Efficient Hardware Accelerator

for AES Cryptography”, Signal Processing and Communications, 2007.

ICSPC 2007. IEEE International Conference on, 2007, pp. 65-68.

[12] “NVIDIA OpenCL Best Practices Guide, Version 1.0”, NVIDIA Corporation,

2009.

[13] J. Ziv and A. Lempel, “Compression of individual sequences via variable-

rate coding”, Information Theory, IEEE Transactions on, vol. 24, 1978, pp.

530-536.

[14] O. Harrison and J. Waldron, “AES Encryption Implementation and Analysis

on Commodity Graphics Processing Units”, Proceedings of the 9th

international workshop on Cryptographic Hardware and Embedded

Systems, Vienna, Austria: Springer-Verlag, 2007, pp. 209-226.

[15] Accelereyes . “GPU Memory Transfer”,

http://wiki.accelereyes.com/wiki/index.php/GPU_Memory_Transfer/

[16] OpenCL - The open standard for parallel programming of heterogeneous

systems, Khronos Group, www.khronos.org/opencl/.

[17] O. Gervasi, D. Russo, and F. Vella, “The AES Implantation Based on OpenCL

for Multi/many Core Architecture”, Computational Science and Its

Applications (ICCSA), 2010 International Conference on, 2010, pp. 129-134.

[18] Lin Zhou and Wenbao Han, “A Brief Implementation Analysis of SHA-1 on

FPGAs, GPUs and Cell Processors”, Engineering Computation, 2009. ICEC

'09. International Conference on, 2009, pp. 101-104.

[19] Lightning Hash Cracker, ElcomSoft Co.Ltd.,

http://www.elcomsoft.com/lhc.html

[20] Guang Hu, Jianhua Ma, and Benxiong Huang, “High Throughput

Implementation of MD5 Algorithm on GPU”, Ubiquitous Information

Technologies & Applications, 2009. ICUT '09. Proceedings of the 4th

International Conference on, 2009, pp. 1-5.

[21] NVIDIA, 2007. Ext gpu shader4 opengl extension,

http://developer.download.nvidia.com/opengl/specs/GL_EXT_gpu_shade

r4.txt.

48

[22] P. Franaszek, J. Robinson, and J. Thomas, “Parallel compression with

cooperative dictionary construction”, Data Compression Conference, 1996.

DCC '96. Proceedings, 1996, pp. 200-209.

[23] Bzip2 compression algorithm, Julian Seward, http://www.bzip.org/

[24] M. Burrows, D.J. Wheeler, M. Burrows, and D.J. Wheeler, “A block-sorting

lossless data compression algorithm”, Technical Report 124, Digital

Equipment Corporation, 1994.

[25] D. Huffman, “A Method for the Construction of Minimum-Redundancy

Codes”, Proceedings of the IRE, vol. 40, 1952, pp. 1098-1101.

[26] Secure Hashing Algorithm (SHA-1) C implementation, Packetizer Inc.,

http://www.packetizer.com/security/sha1/

[27] D. J. Bernstein, “Why switch from AES to a new stream cipher?”,

http://cr.yp.to/streamciphers/why.html

parallel hashing,

Documents

lightning

geforce 9600m

graphics processor

pci express

rsa data security

relevant academic

advanced encryption

block size