parallel hashing 1

42
Parallel Hashing  John Erol Ev angelista

Upload: john-erol-evangelista

Post on 05-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 1/42

Parallel Hashing John Erol Evangelista

Page 2: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 2/42

Definition of Terms

• GPU. Graphical Processing Unit

• Parallel Architecture. Architecture where

calculations are done simultaneously

• Serial Architecture. Architecture wherecalculations are done serially

• Voxel. 3D Analog of Pixel

• Kernels. Programs that run on the GPU.

Page 3: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 3/42

Definition of Terms

• Threads. Smallest unit of processing.

• Latency. Time Delay

• Cache. Storage of data.

• Race condition. Output is dependenton the timing of the events.

Page 4: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 4/42

GPU

• Graphics Processing Unit

• Its highly parallel architecture wasrecognized for its fast number

crunching abilities, giving rise to

techniques for applying GPU for non-graphical purpose.

Page 5: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 5/42

Data Structures

• Applications rely on data structures

that can be both built and used

efficiently in parallel environment.

• Defining parallel-friendly data

structures that can be efficiently

created, updated and accessed is a

significant research challenge.

Page 6: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 6/42

Voxel

• 3D analog of the pixel

• Number of expected occupied voxels:

O(N2).

• Storing N3

grid is extremely wastefulsince most of the grid is empty.

Page 7: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 7/42

Hash Table

• Popular for these types of data (voxels)

since they can be constructed to answer

queries in O(1) memory accesses.

Page 8: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 8/42

Figure 1.2. GPU hash tables are being constructed and queried every frame toperform Boolean intersections for these two animated models. Blue parts of onemodel represent voxels inside the other model, while green parts mark surfaceintersections. These images were produced using a 1283 voxel grid for pointclouds of approximately 160k points. We achieve frame rates between 25–29 fpson a GTX 280, with the actual computation of the intersection and flood-fillrequiring between 15–19 ms. Most of the time per frame is devoted to actual

rendering of the meshes.

Application

Page 9: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 9/42

Hash Tables

Figure 1.3. While allocating storage for the value of every possible key in anarray allows directly indexing into the structure, it is wasteful when the arrayis mostly unused (top). A hash table can be used instead, which allocates farless space than the array (bottom). In this example, each slot holds both a keyand its value. The table is indexed into using a hash function h(k). Becausemultiple keys may map to the same location, the key contained in the slot andthe query key are compared on a retrieval to ensure the right value is returned.

Page 10: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 10/42

Hash Tables

• Needs to be adapted on a parallel

environment• Serialization

• Memory Accesses are Slow

• Many probes may be required

Page 11: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 11/42

CUDA

• stands for “Compute Unified Device

Architecture”• provide essential functionality for

parallel applications such as scattered

writes in memory and atomicoperations

Page 12: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 12/42

CUDA C

• high-level GPU programming language

that extends C with extra constructs for

dealing with the hardware.

Page 13: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 13/42

How it works

• Programs that run on the GPU are

called kernels and typically consist of 

 just a few small functions.

• Kernels are executed in parallel by

threads, each performing the same

instructions on a different data.• e.g. programs computing hash function

value of every input key.

Page 14: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 14/42

Limitation

• Copying data to and from the GPU is

very expensive.

• Kernels do not have access to the hostsystem’s memory.

• Solution: Use data structures that can

 be built and used entirely in parallel,allowing data to stay in the GPU while

it is being processed.

Page 15: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 15/42

How it works

• Threads are grouped into thread blocks

of up to 512 threads, which are assigned

to different streaming multiprocessors (SM) for execution.

• Thread blocks are queued up for the

SMs and fed in as the thread blocks

finish

Page 16: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 16/42

How it works

• Thread blocks can complete execution

 before others are even started, so there

is no way to globally synchronize all thethreads without finishing the kernels.

• Threads in the same block can locally

synchronize using execution barriers,guaranteeing that they have all reachedthe same point before continuing.

Page 17: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 17/42

How it works

• Multiple thread blocks can be handled by SMs simultaneously, but there is a

hard limit on the number of threads the

SM can handle.

Page 18: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 18/42

How it works

• Each SM breaks its thread blocks into

groups of 32 consecutive threads called

warps.• SMs manage when each of their warps

will be executed in their SIMD cores,

with each step running the sameinstruction in lockstep, even when a

 branch occurs.

Page 19: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 19/42

Types of memory

• low-latency shared memory

• high-latency global memory

Page 20: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 20/42

Low latency memory

• used as cache for global memory

• scratchpad for threads working in thesame thread block 

• fast but small

• partitioned; does not persist between

kernel operations

Page 21: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 21/42

Global Memory

• Abundant and shared but slow

• To hide latency, SMs automatically context

switch to other warps while memorytransactions are being performed

• reads up to 128-byte segments of memory

with a single transaction• memory requests of threads in a warp are

coalesced together into fewer transactions.

Page 22: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 22/42

Atomic Operations

• performed when race conditions are

difficult or impossible to avoid.

• perform a series of actions that cannot

 be interrupted.

• e.g. incrementing a counter

Page 23: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 23/42

Fermi architecture

• higher compute capabilities, more

functionality• efficient atomic operations, cached

memory hierarchy to further reduce

latency when accessing a globalmemory.

Page 24: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 24/42

Hashing on GPU

• Open Addressing

• While they can be very fast for bothconstruction and retrieval on a GPU,

problems arise when trying to make a

compact table: in the worst case, the

whole table would have to be

traversed to terminate a query.

Page 25: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 25/42

Hashing on a GPU

• Chaining

• number of probes increases greatly as

the number of slots shrinks.

• linked lists are horribly inefficient in aGPU

Page 26: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 26/42

Hashing on a GPU

• Collision-free hashing

• larger space = constant probability of no collision

• increased construction time and

inherently sequential on someimplementation

Page 27: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 27/42

Hashing on a GPU

• Multiple-choice Hashing

• Choose the one that has the lowest

occupancy

• Cuckoo Hashing

• Variation of Open Hashing, limits theslots an item can fall to

• uses multiple hash functions

Page 28: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 28/42

Performance Metrics

• Construction time

• Retrieval efficiency

• Memory usage

Page 29: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 29/42

Open Addressing

• Race condition may occur (multiple

threads attempting to insert an item to

the same location simultaneously)

Page 30: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 30/42

Open Addressing

Figure 3.1. Examples of linear probing (left) and quadratic probing (right).

Page 31: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 31/42

Open Addressing

• The parallel construction assigns each

input item to a thread, then has eachthread simultaneously probe the hash

table for empty slots

• force serialization of access to the table

Page 32: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 32/42

Parameters

• Number of slots: ST ≥

N where ST is thenumber of slots and N is the number of 

items in the input. ST ≈ 1.25N

• Probe SequenceProbing scheme Hash function

Linear probing h(k) = g(k) + iteration

Quadratic probing h(k) = g(k) + c0 · iteration + c1 · iteration2

Double hashing h(k) = g(k) + jump(k) · iteration

Table 3.1. Open addressing hashing schemes

Page 33: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 33/42

Parameters

• Maximum allowed length of ProbeSequence. Used to terminate a probe

sequence that is taking too much time.

Min(N,10000).

Page 34: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 34/42

Hash Function

• Perfect Hash Function. Benefits are

minimal since the hash tables can be

constructed in a way that effectivelylimits the number of probes required to

find an item to just one or two

• Simple randomized hash functions

work well in practice

Page 35: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 35/42

Hash Function

• g(k) = (f(a,k) + b) mod p mod ST

• Where a and b are randomly generated

constant, p is a prime number and ST is

the number of slots available in the

hash table

Page 36: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 36/42

Implementation

Algorithm 3.1 Process for creating an open addressing hash table.

1: allocate enough memory for table [ ], which will contain S T  64-bit slots

2: repeat

3: fill each slot with ∅

4: generate a new hash function for the current attempt

5: for all key-value pairs (k, v) in the input do

6: repeat

7: atomically check-and-set table [location]

8: advance location to next location in probe sequence

9: until ∅ is found or max probes hit

10: end for

11: until hash table is built

Listing 3.1. Parallel insertion of items into an open addressing table.

Page 37: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 37/42

1 d e v ic e b oo l i n s er t e n tr y ( const unsigned key ,

2 const unsigned value ,

3 const unsigned t a b l e s i z e ,

4 Entry ∗ t a b l e ) {

5 // M anage t h e k ey a n d i t s v a l u e a s a s i n g l e 64− b i t e n t ry .

6 E nt ry e n t r y = ( ( E nt ry ) k e y << 3 2 ) + v a l u e ;

7

8 // F i gu r e o u t w h er e t h e i te m n e ed s t o b e h as h ed i n t o .

9 unsigned in d e x = h a s h f u n c t i o n ( key ) ;

10 unsigned d o u bl e h a s h j u m p = j u m p f u n c t i o n ( k e y ) + 1 ;

11

12 // Keep t r y i n g t o i n s e r t t h e e n t ry i n t o t he h as h t a b l e  

13 // u n t i l an empty s l o t i s f ou nd .

14 E nt ry o l d e n t r y ;

15

fo r ( unsigned a t te m pt = 1 ; a t te m pt <= kMaxProbes ; ++attempt) {16 // Move t h e i n d ex s o t h a t i t p o i n t s s om ew he re w i t h i n t h e t a b l e .

17 i n de x %= t a b l e s i z e ;

18

19 // A t om i c al l y c h ec k t h e s l o t a nd i n s e r t t h e k ey i f e mp ty .

20 ol d en try = atomicCAS( tab le + index , SLOT EMPTY, entry );

21

22 // I f t h e s l o t was empty , t h e i te m w as i n s e r t ed s a f e l y .

23 i f  ( ol d en tr y == SLOT EMPTY) return t r u e ;24

25 // Move t h e i n s e r t i o n i n d ex .

26 i f  ( m et ho d == LINEAR ) in d e x += 1 ;

27 e ls e i f   ( method == QUADRATIC) in de x += att emp t ∗ attempt ;

28 e l s e index += attempt ∗ d o u b le h a s h j u m p ;

29 }

30

31 return f a l s e ;

32 }

Page 38: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 38/42

Parallel Retrieval

• Follows same search pattern as

construction

Page 39: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 39/42

Construction Rates

Figure 3.2. Eff ect of input size on construction retrieval rates for tables con-taining 1.25N  slots on both the GTX 280 (top) and 470 (bottom).

Page 40: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 40/42

Memory Usage

Figure 3.3. Eff ect of the table size on construction and retrieval rates for tables

containing 10 million items.

Page 41: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 41/42

Limitations

• Performance drops significantly for

compact tables

• High variability in probe sequence

length

• Removing items from the table.

Page 42: Parallel Hashing 1

7/31/2019 Parallel Hashing 1

http://slidepdf.com/reader/full/parallel-hashing-1 42/42

Sources

• Alcantara, D., Efficient Hash Tables on a

GPU.