generalized hashing with variable-length bit strings michael klipper with dan blandfordguy blelloch...

Generalized Hashing with Variable-Length Bit Strings

Michael KlipperWith

Dan Blandford Guy Blelloch

Original source:

D. Blandford and G. E. Blelloch. Storing Variable-Length Keys in Arrays, Sets, and Dictionaries, with Applications. In Symposium on Discrete Algorithms (SODA), 2005 (hopefully)

Hashing techniquescurrently available

Many hashing algorithms out there: Separate chaining Cuckoo hashing FKS perfect hashing

Also many hash functions designed, including several universal families

(1) expected amortized time for updates, and many have (1) worst case time for searches

They use (n lg n) bits for n entries, since at least lg n bits are used per entry to distinguish between keys.

What kind of bounds do we achieve?

Let’s say we store n entries in our hashtable of the form (si, ti) for i = 0, 1, 2, … (n-1). Each si and ti are bit strings of variable length. For our purposes, many of the ti’s might only be a few bits long.

Time for all operations (later slide):(1) expected amortized

Total space used:(i max(|si| - lg n, 1) + |ti|) bits

The Improvement We AttainLet’s say we store n entries taking up m total bits. In terms of the si and ti values on the previous slide,

m = i |si| + |ti|Note that m = (n lg n).

Thus, our space usage is (m – n lg n) bits, as opposed to the (m) bits that standard hashtable structures use.

In particular, our structure is much more efficient than standard structures when m is close to n lg n (for example, when most entries are only a few bits long).

Goal:Generalized Dynamic Hashtables

We want to support the following operations: query(key, keyLength)

Looks up the key in the hashtable and returns the data associated and its length

insert(key, keyLength, data, dataLength) Adds (key, data) as an entry in the hashtable

remove(key, keyLength) Removes the key and the data associated

NOTE: Each key will only have one entry associated with it. Another name for this kind of structure is a variable-length dictionary structure.

Other Structures Variable-Length Sets

Also supports query, insert, and remove, though there is no extra data associated with keys

Can be easily implemented as a generalized hashtable that stores no extra data

(1) expected amortized time for all operations If the n keys are s0, s1, … sn-1, then the total

space used in bits is(i max(|si| - lg n, 1))

Other Structures (cont.) Variable-Length Arrays

For n entries, the keys are 0, 1, … n-1. These arrays will not be able to resize their

number of entries. Operations:

get(i) returns the data stored at index i and its length

set(i, val, len) updates the data at index i to val of length len

Once again, (1) expected amortized time for operations. Total space usage is (i |ti|).

Implementation Note

Assume for now that we have a variable-length array structure described on the previous slide. We will use this to make generalized dynamic hashtables, which are more interesting than the arrays.

At the end of this presentation, I can talk about implementation of variable-length arrays if time permits.

The Main Idea BehindHow Hashtables Work

Our generalized hashtable structure contains a variable-length array with 2q entries (which will serve as the buckets for the hashtable). We keep 2q approximately equal to n by occasional rehashing of the bucket contents.

The item (si, ti), where si is the key and ti is the data, is placed in a bucket as follows: we first hash si to some index (more on this later), and we write (si, ti) into the bucket specified by that index. Note that when we hash si, we implicitly treat it as an integer.

Hashtables (cont.)If several entries of the set collide in a bucket, we throw them all into the bucket together as one giant concatenated bit string. Thus, we essentially use a separate-chaining algorithm.

To tell where one entry starts and another begins,we encode the entries with a prefix-free code (such as Huffman codes or gamma codes).

s1’ t1’ s2’ t2’ s3’ t3’Sample bucket

(where si’ issi encoded, etc.)

Time and Space BoundsNote that we use prefix-free codes that only use a constant factor more space (i.e. they encode m bits in (m) space) and can be encoded/decoded in (1) time.

Time: If we use a universal hash function to determine the bucket index, then each bucket receives only a constant expected number of elements, so it takes (1) expected amortized time to find an element in a bucket. The prefix-free codes we use allow (1) decoding of any element.

Space: The prefix-free codes increase the amount of bits stored by at most a constant factor. If we have m bits total we want to store, our space bound for variable-length arrays says that the buckets take up (m) bits.

There’s a bit more than that…Recall the space bound for the hash table is

(i max(|si| - lg n, 1) + |ti|).Where does the lg n savings per entry come from?

We perform a technique called quotienting.

We actually use two hash functions h’ and h’’. h’(si) is the bucket index, and h’’(si) has length max(|si| - q, 1). (Recall that 2q is approximately n.)

Instead of writing (si, ti) in the bucket, we actually write (h’’(si), ti). This way, each entry needs |h’’(si)| + |ti| bits to write, which fulfills our space bound above.

A Quotienting SchemeLet h0 be a hash function from a universal family whose range is q bits. We describe a way to make a family of hash functions from the family from which h0 is drawn.

Let sit be the q most

significant bits of si,

and let sib be the other bits.

We define our hash functionsas follows:

h’’(si) = sib

h’(si) = h0(sib) xor si

t

101101 001010100100101

= h’’(si)

si

sibsi

t

h0

+

h’(si)

010011

111110

Undoing the QuotientingIn the previous example, we saw that h’(si) evaluated to 111110, or 62. This means we store h’’(si) in bucket number 62!

Note that given h’(si) and h’’(si) we can retrieve si because

sib = h’’(si)

andsi

t = h0(h’’(si)) xor h’(si).

The family of h’ functions we make is another universal family, so our time bound explained earlier still holds.

An Application of Hashtables: Graph Structures

One area where we will be able to use the hashtable structure is in storing graphs. Here, we describe a semidynamic directed-graph implementation. This means that the number of vertices is fixed, but edges can be added or deleted at runtime.

Let u and v be vertices of a graph. We want the following operations compactly and in (1) expected amortized time:

• deg(v) - get the degree of vertex v• adjacent(u, v) - returns true iff u and v are adjacent• firstEdge(v) - returns the first neighbor of v in G• nextEdge(u, v) - returns the next neighbor of u after

v (assumes u and v are adjacent)

• addEdge(u, v) - adds an edge from u to v in G• deleteEdge(u, v) - deletes the edge (u, v) from G

Hashing Integers

Up to now, we have used bit strings as the main objects in the hashtable. It will also be useful to hash on integer values. Hence, we have created some utilities to convert between bit strings and integers using as few bits as possible, so an integer x takes basically lg |x| bits to write as a bit string.

A Graph Layout Where We Store Edges in a Hashtable

Let’s say u is a vertex of degree d and v1, … vd are its neighbors. Let’s say that v0 = vd+1 = u by convention.Then the entry representing the edge (u, v i) has key (u, vi) and data (vi-1, vi+1).

uv2

v1

v3

uv1

u

v2

uu

v4

v1

uv3

v2

v4

4

Hash Table

Degree of Vertex

This extraentry “starts”

the list.

Implementations of a Couple OperationsFor simplicity, I’m leaving off the length arguments in query() and insert().

adjacent(u, v)• return (query((u, v)) != -1);

firstEdge(u)• let (vp, vn, d) = query((u, u));• return vn;

addEdge(u, v)• let (vp, vn, d) = query((u, u));• remove((u, u));• insert((u, u), (vp, v, d + 1));• insert((u, v), (u, vn));

Compression and Space Usage Instead of ((u, vi), (vi-1, vi+1)) in the table, we

will store ((u, vi – u), (vi-1 – u, vi+1 – u))

With this representation, we need ((u,v)E lg |u – v|) space.

A good labeling of the vertices will make many of these differences small. For instance, for many classes of graphs, such as planar graphs, the total space used is (n) bits! The following paper has details:

D. Blandford, G. E. Blelloch, and I. Kash. Compact Representations of Separable Graphs. In SODA, 2003, pages 342-351.

More Details aboutImplementing Arrays

We’ll use the following data for our example in these slides:

t0 = 10110 t1 = 0110 t2 = 11111

t3 = 0101 t4 = 1100 t5 = 010 t6 = 11011

t7 = 00001111

We’ll assume that the word size is 2 bytes.

Key Idea: BLOCKS Multiple data items can be crammed into a

word, so let’s take advantage of that. There are many possible ways to store data

in blocks. The way that I’ll discuss here is to use two words per block: one stores data and one marks separation of entries.

0 1 1 0b0

2nd word

1 0 1 1 0 1 1 1 1 1

This is the block containing strings t0 through t2 from our example.

1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0

1st word

Example:

Blocks: continuedWe’ll name a block bi if i is the first entry number to be stored in that block. The size of a block is the sum of the sizes of the entries inside it.

We’ll maintain a size invariant:for any adjacent blocks bi and bj, |bi| + |bj| is at

least a full word.Note: splitting and merging blocks is easy.

We assume these things for now: Entries fit into a word… we can handle longer entries

by storing a pointer to separate memory in its place Entries are nonempty

Organization of Blocks We have a bit array A of

length n (this is a regular old C array). A[i] = 1 if and only if string #i starts a block. This is our indexing structure.

We also have a standard hashtable H. If string #i starts a block, H(i) = address of bi. We assume H is computed in (1) expected amortized time.

Blocks are large enough that storing them in H only increases the space usage by a constant factor.

t0 t1 t2b0

b3

b7

A

1

1

1

0

000

0t3 t4 t5 t6

t7

H(0)

H(3)

H(7)

Example:

In this example, b0 andb3 are adjacent blocks, asare b3 and b7.

A Note about Space Usage

Any two 1’s in the indexing structure A are separated by at most one word. This is because entries are nonempty and a block only holds one word for entries.

The get() operation Since bits that are turned on in A are close,

we can find the block to which an entry belongs in (1) time. One way to do this is table lookup.

If the ith entry is in block bk, then the ith entry of the array is the (i – k + 1)st entry in that block.

By using table lookup, we can find where the correct 1’s in the second word are, which tell us where the entry starts and ends.

A picture of the get() operation, illustrated with get(2)

1

1

1

0

000

0

A

A[2]H(0)

To find entry #2, we look in block #0.

0 1 1 0b0 1 0 1 1 0 1 1 1 1 1

1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0

start end

Conclusion:Entry 2 is 5 bits long.It is 11111.

How set() works in a nutshell

1) Find the block with the entry.2) Rewrite it.3) If the block is too large, split it into

two.4) Merge adjacent blocks together to

preserve the size invariant.

Now, to prove the theorem about space usage for arrays

Let m = i |ti| and w = machine word size. I claim the total number of bits used is (m).

Our size invariant for blocks guarantees that on average, blocks are half full. Thus, there are (m / w) blocks used, since there are m bits total of data and each block has (w) bits stored in it on average.

Our indexing structure A and hashtable H use (w) bits per block ((1) words). Total bits:

(m / w) blocks * (w) per block = (m) bits.

A note about entrieslonger than w bits

What is really done in our code with entries longer than w bits is not just allocating separate memory and putting a pointer in the array, though it’s close.

We do essentially what standard structures do, and we chain the words making up our entry into a linked list. We have a clever way to do this which doesn’t need to use w-bit pointers; instead we only need 7 or 8 bits for a pointer.

generalized hashing with variable-length bit strings michael klipper with dan blandfordguy blelloch...

Documents