faster persistent data structures through hashing

Faster persistent data structures through hashing

Johan [email protected]

2011-02-15

Motivating problem: Twitter data analys

“I’m computing a communication graph from Twitter data andthen scan it daily to allocate social capital to nodes behaving in agood karmic manner. The graph is culled from 100 million tweetsand has about 3 million nodes.”

We need a data structure that is

I fast when used with string keys, and

I doesn’t use too much memory.

Persistent maps in Haskell

I Data.Map is the most commonly used map type.

I It’s implemented using size balanced trees.

I Keys can be of any type, as long as values of the type can beordered.

Real world performance of Data.Map

I Good in theory: no more than O(log n) comparisons.

I Not great in practice: up to O(log n) comparisons!

I Many common types are expensive to compare e.g String,ByteString, and Text.

I Given a string of length k, we need O(k ∗ log n) comparisonsto look up an entry.

Hash tables

I Hash tables perform well with string keys: O(k) amortizedlookup time for strings of length k .

I However, we want persistent maps, not mutable hash tables.

Milan Straka’s idea: Patricia trees as sparse arrays

I We can use hashing without using hash tables!

I A Patricia tree implements a persistent, sparse array.

I Patricia trees are as twice fast as size balanced trees, but onlywork with Int keys.

I Use hashing to derive an Int from an arbitrary key.

Implementation tricks

I Patricia trees implement a sparse, persistent array of size 232

(or 264).

I Hashing using this many buckets makes collisions rare: for 224

entries we expect about 32,000 single collisions.

I Linked lists are a perfectly adequate collision resolutionstrategy.

First attempt at an implementation

−− Def ined i n the c o n t a i n e r s package .data IntMap a

= N i l| Tip {−# UNPACK #−} ! Key a| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( IntMap a ) ! ( IntMap a )

type P r e f i x = Inttype Mask = Inttype Key = Int

newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )

A more memory efficient implementation

data HashMap k v= N i l| Tip {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( FL . F u l l L i s t k v )| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( HashMap k v ) ! ( HashMap k v )

type P r e f i x = Inttype Mask = Inttype Hash = Int

data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )

Reducing the memory footprint

I List k v uses 2 fewer words per key/value pair than [(k, v)] .

I FullList can be unpacked into the Tip constructor as it’s aproduct type, saving 2 more words.

I Always unpack word sized types, like Int, unless you reallyneed them to be lazy.

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMapinsert 1.00 0.43

lookup 1.00 0.28

delete performs like insert .

Can we do better?

I We still need to perform O(min(W , log n)) Int comparisons,where W is the number of bits in a word.

I The memory overhead per key/value pair is still quite high.

Borrowing from our neighbours

I Clojure uses a hash-array mapped trie (HAMT) data structureto implement persistent maps.

I Described in the paper “Ideal Hash Trees” by Bagwell (2001).

I Originally a mutable data structure implemented in C++.

I Clojure’s persistent version was created by Rich Hickey.

Hash-array mapped tries in Clojure

I Shallow tree with high branching factor.

I Each node, except the leaf nodes, contains an array of up to32 elements.

I 5 bits of the hash are used to index the array at each level.

I A clever trick, using bit population count, is used to representspare arrays.

The Haskell definition of a HAMT

data HashMap k v= Empty| BitmapIndexed

{−# UNPACK #−} ! Bitmap{−# UNPACK #−} ! ( Array ( HashMap k v ) )

| L e a f {−# UNPACK #−} ! ( L e a f k v )| F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )| C o l l i s i o n {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( Array ( L e a f k v ) )

type Bitmap = Word

data Array a = Array ! ( Array# a ){−# UNPACK #−} ! Int

Making it fast

I The initial implementation by Edward Z. Yang: correct butdidn’t perform well.

I Improved performance byI replacing use of Data.Vector by a specialized array type,I paying careful attention to strictness, andI using GHC’s new INLINABLE pragma.

Benchmarks


Map HashMap HashMap (HAMT)insert 1.00 0.43 1.21

lookup 1.00 0.28 0.21

Where is all the time spent?

I Most time in insert is spent copying small arrays.

I Array copying is implemented using indexArray# andwriteArray#, which results in poor performance.

I When cloning an array, we are force to first fill the new arraywith dummy elements, followed by copying over the elementsfrom the old array.

A better array copy

I Daniel Peebles have implemented a set of new primops forcopying array in GHC.

I The first implementation showed a 20% performanceimprovement for insert .

I Copying arrays is still slow so there might be room for bigimprovements still.

Other possible performance improvements

I Even faster array copying using SSE instructions, inlinememory allocation, and CMM inlining.

I Use dedicated bit population count instruction onarchitectures where it’s available.

I Clojure uses a clever trick to unpack keys and values directlyinto the arrays; keys are stored at even positions and values atodd positions.

I GHC 7.2 will use 1 less word for Array.

Optimize common cases

I In many cases maps are created in one go from a sequence ofkey/value pairs.

I We can optimize for this case by repeatedly mutating theHAMT and freeze it when we’re done.


fromList/pure 1.00

fromList/mutating 0.50

Abstracting over collection types

I We will soon have two map types worth using (one orderedand one unordered).

I We want to write function that work with both types, withouta O(n) conversion cost.

I Use type families to abstract over different concreteimplementation.

Summary

I Hashing allows us to create more efficient data structures.

I There are new interesting data structures out there that have,or could have, an efficient persistent implementation.

faster persistent data structures through hashing

Education

v data

array hashmap

word data array

array copying

i s i o n

new array

specialized array type

v type bitmap