generalized hashing with variable-length bit strings michael klipper with dan blandfordguy blelloch...
Post on 19-Dec-2015
225 views
TRANSCRIPT
Generalized Hashing with Variable-Length Bit Strings
Michael KlipperWith
Dan Blandford Guy Blelloch
Original source:
D. Blandford and G. E. Blelloch. Storing Variable-Length Keys in Arrays, Sets, and Dictionaries, with Applications. In Symposium on Discrete Algorithms (SODA), 2005 (hopefully)
Hashing techniquescurrently available
Many hashing algorithms out there: Separate chaining Cuckoo hashing FKS perfect hashing
Also many hash functions designed, including several universal families
(1) expected amortized time for updates, and many have (1) worst case time for searches
They use (n lg n) bits for n entries, since at least lg n bits are used per entry to distinguish between keys.
What kind of bounds do we achieve?
Let’s say we store n entries in our hashtable of the form (si, ti) for i = 0, 1, 2, … (n-1). Each si and ti are bit strings of variable length. For our purposes, many of the ti’s might only be a few bits long.
Time for all operations (later slide):(1) expected amortized
Total space used:(i max(|si| - lg n, 1) + |ti|) bits
The Improvement We AttainLet’s say we store n entries taking up m total bits. In terms of the si and ti values on the previous slide,
m = i |si| + |ti|Note that m = (n lg n).
Thus, our space usage is (m – n lg n) bits, as opposed to the (m) bits that standard hashtable structures use.
In particular, our structure is much more efficient than standard structures when m is close to n lg n (for example, when most entries are only a few bits long).
Goal:Generalized Dynamic Hashtables
We want to support the following operations: query(key, keyLength)
Looks up the key in the hashtable and returns the data associated and its length
insert(key, keyLength, data, dataLength) Adds (key, data) as an entry in the hashtable
remove(key, keyLength) Removes the key and the data associated
NOTE: Each key will only have one entry associated with it. Another name for this kind of structure is a variable-length dictionary structure.
Other Structures Variable-Length Sets
Also supports query, insert, and remove, though there is no extra data associated with keys
Can be easily implemented as a generalized hashtable that stores no extra data
(1) expected amortized time for all operations If the n keys are s0, s1, … sn-1, then the total
space used in bits is(i max(|si| - lg n, 1))
Other Structures (cont.) Variable-Length Arrays
For n entries, the keys are 0, 1, … n-1. These arrays will not be able to resize their
number of entries. Operations:
get(i) returns the data stored at index i and its length
set(i, val, len) updates the data at index i to val of length len
Once again, (1) expected amortized time for operations. Total space usage is (i |ti|).
Implementation Note
Assume for now that we have a variable-length array structure described on the previous slide. We will use this to make generalized dynamic hashtables, which are more interesting than the arrays.
At the end of this presentation, I can talk about implementation of variable-length arrays if time permits.
The Main Idea BehindHow Hashtables Work
Our generalized hashtable structure contains a variable-length array with 2q entries (which will serve as the buckets for the hashtable). We keep 2q approximately equal to n by occasional rehashing of the bucket contents.
The item (si, ti), where si is the key and ti is the data, is placed in a bucket as follows: we first hash si to some index (more on this later), and we write (si, ti) into the bucket specified by that index. Note that when we hash si, we implicitly treat it as an integer.
Hashtables (cont.)If several entries of the set collide in a bucket, we throw them all into the bucket together as one giant concatenated bit string. Thus, we essentially use a separate-chaining algorithm.
To tell where one entry starts and another begins,we encode the entries with a prefix-free code (such as Huffman codes or gamma codes).
s1’ t1’ s2’ t2’ s3’ t3’Sample bucket
(where si’ issi encoded, etc.)
Time and Space BoundsNote that we use prefix-free codes that only use a constant factor more space (i.e. they encode m bits in (m) space) and can be encoded/decoded in (1) time.
Time: If we use a universal hash function to determine the bucket index, then each bucket receives only a constant expected number of elements, so it takes (1) expected amortized time to find an element in a bucket. The prefix-free codes we use allow (1) decoding of any element.
Space: The prefix-free codes increase the amount of bits stored by at most a constant factor. If we have m bits total we want to store, our space bound for variable-length arrays says that the buckets take up (m) bits.
There’s a bit more than that…Recall the space bound for the hash table is
(i max(|si| - lg n, 1) + |ti|).Where does the lg n savings per entry come from?
We perform a technique called quotienting.
We actually use two hash functions h’ and h’’. h’(si) is the bucket index, and h’’(si) has length max(|si| - q, 1). (Recall that 2q is approximately n.)
Instead of writing (si, ti) in the bucket, we actually write (h’’(si), ti). This way, each entry needs |h’’(si)| + |ti| bits to write, which fulfills our space bound above.
A Quotienting SchemeLet h0 be a hash function from a universal family whose range is q bits. We describe a way to make a family of hash functions from the family from which h0 is drawn.
Let sit be the q most
significant bits of si,
and let sib be the other bits.
We define our hash functionsas follows:
h’’(si) = sib
h’(si) = h0(sib) xor si
t
101101 001010100100101
= h’’(si)
si
sibsi
t
h0
+
h’(si)
010011
111110
Undoing the QuotientingIn the previous example, we saw that h’(si) evaluated to 111110, or 62. This means we store h’’(si) in bucket number 62!
Note that given h’(si) and h’’(si) we can retrieve si because
sib = h’’(si)
andsi
t = h0(h’’(si)) xor h’(si).
The family of h’ functions we make is another universal family, so our time bound explained earlier still holds.
An Application of Hashtables: Graph Structures
One area where we will be able to use the hashtable structure is in storing graphs. Here, we describe a semidynamic directed-graph implementation. This means that the number of vertices is fixed, but edges can be added or deleted at runtime.
Let u and v be vertices of a graph. We want the following operations compactly and in (1) expected amortized time:
• deg(v) - get the degree of vertex v• adjacent(u, v) - returns true iff u and v are adjacent• firstEdge(v) - returns the first neighbor of v in G• nextEdge(u, v) - returns the next neighbor of u after
v (assumes u and v are adjacent)
• addEdge(u, v) - adds an edge from u to v in G• deleteEdge(u, v) - deletes the edge (u, v) from G
Hashing Integers
Up to now, we have used bit strings as the main objects in the hashtable. It will also be useful to hash on integer values. Hence, we have created some utilities to convert between bit strings and integers using as few bits as possible, so an integer x takes basically lg |x| bits to write as a bit string.
A Graph Layout Where We Store Edges in a Hashtable
Let’s say u is a vertex of degree d and v1, … vd are its neighbors. Let’s say that v0 = vd+1 = u by convention.Then the entry representing the edge (u, v i) has key (u, vi) and data (vi-1, vi+1).
uv2
v1
v3
uv1
u
v2
uu
v4
v1
uv3
v2
v4
4
Hash Table
Degree of Vertex
This extraentry “starts”
the list.
Implementations of a Couple OperationsFor simplicity, I’m leaving off the length arguments in query() and insert().
adjacent(u, v)• return (query((u, v)) != -1);
firstEdge(u)• let (vp, vn, d) = query((u, u));• return vn;
addEdge(u, v)• let (vp, vn, d) = query((u, u));• remove((u, u));• insert((u, u), (vp, v, d + 1));• insert((u, v), (u, vn));
Compression and Space Usage Instead of ((u, vi), (vi-1, vi+1)) in the table, we
will store ((u, vi – u), (vi-1 – u, vi+1 – u))
With this representation, we need ((u,v)E lg |u – v|) space.
A good labeling of the vertices will make many of these differences small. For instance, for many classes of graphs, such as planar graphs, the total space used is (n) bits! The following paper has details:
D. Blandford, G. E. Blelloch, and I. Kash. Compact Representations of Separable Graphs. In SODA, 2003, pages 342-351.
More Details aboutImplementing Arrays
We’ll use the following data for our example in these slides:
t0 = 10110 t1 = 0110 t2 = 11111
t3 = 0101 t4 = 1100 t5 = 010 t6 = 11011
t7 = 00001111
We’ll assume that the word size is 2 bytes.
Key Idea: BLOCKS Multiple data items can be crammed into a
word, so let’s take advantage of that. There are many possible ways to store data
in blocks. The way that I’ll discuss here is to use two words per block: one stores data and one marks separation of entries.
0 1 1 0b0
2nd word
1 0 1 1 0 1 1 1 1 1
This is the block containing strings t0 through t2 from our example.
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
1st word
Example:
Blocks: continuedWe’ll name a block bi if i is the first entry number to be stored in that block. The size of a block is the sum of the sizes of the entries inside it.
We’ll maintain a size invariant:for any adjacent blocks bi and bj, |bi| + |bj| is at
least a full word.Note: splitting and merging blocks is easy.
We assume these things for now: Entries fit into a word… we can handle longer entries
by storing a pointer to separate memory in its place Entries are nonempty
Organization of Blocks We have a bit array A of
length n (this is a regular old C array). A[i] = 1 if and only if string #i starts a block. This is our indexing structure.
We also have a standard hashtable H. If string #i starts a block, H(i) = address of bi. We assume H is computed in (1) expected amortized time.
Blocks are large enough that storing them in H only increases the space usage by a constant factor.
t0 t1 t2b0
b3
b7
A
1
1
1
0
000
0t3 t4 t5 t6
t7
H(0)
H(3)
H(7)
Example:
In this example, b0 andb3 are adjacent blocks, asare b3 and b7.
A Note about Space Usage
Any two 1’s in the indexing structure A are separated by at most one word. This is because entries are nonempty and a block only holds one word for entries.
The get() operation Since bits that are turned on in A are close,
we can find the block to which an entry belongs in (1) time. One way to do this is table lookup.
If the ith entry is in block bk, then the ith entry of the array is the (i – k + 1)st entry in that block.
By using table lookup, we can find where the correct 1’s in the second word are, which tell us where the entry starts and ends.
A picture of the get() operation, illustrated with get(2)
1
1
1
0
000
0
A
A[2]H(0)
To find entry #2, we look in block #0.
0 1 1 0b0 1 0 1 1 0 1 1 1 1 1
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
start end
Conclusion:Entry 2 is 5 bits long.It is 11111.
How set() works in a nutshell
1) Find the block with the entry.2) Rewrite it.3) If the block is too large, split it into
two.4) Merge adjacent blocks together to
preserve the size invariant.
Now, to prove the theorem about space usage for arrays
Let m = i |ti| and w = machine word size. I claim the total number of bits used is (m).
Our size invariant for blocks guarantees that on average, blocks are half full. Thus, there are (m / w) blocks used, since there are m bits total of data and each block has (w) bits stored in it on average.
Our indexing structure A and hashtable H use (w) bits per block ((1) words). Total bits:
(m / w) blocks * (w) per block = (m) bits.
A note about entrieslonger than w bits
What is really done in our code with entries longer than w bits is not just allocating separate memory and putting a pointer in the array, though it’s close.
We do essentially what standard structures do, and we chain the words making up our entry into a linked list. We have a clever way to do this which doesn’t need to use w-bit pointers; instead we only need 7 or 8 bits for a pointer.