faster persistent data structures through hashing

23
Faster persistent data structures through hashing Johan Tibell [email protected] 2011-02-15

Upload: johan-tibell

Post on 12-May-2015

8.983 views

Category:

Education


2 download

DESCRIPTION

This talk was given at Galois.

TRANSCRIPT

Page 1: Faster persistent data structures through hashing

Faster persistent data structures through hashing

Johan [email protected]

2011-02-15

Page 2: Faster persistent data structures through hashing

Motivating problem: Twitter data analys

“I’m computing a communication graph from Twitter data andthen scan it daily to allocate social capital to nodes behaving in agood karmic manner. The graph is culled from 100 million tweetsand has about 3 million nodes.”

We need a data structure that is

I fast when used with string keys, and

I doesn’t use too much memory.

Page 3: Faster persistent data structures through hashing

Persistent maps in Haskell

I Data.Map is the most commonly used map type.

I It’s implemented using size balanced trees.

I Keys can be of any type, as long as values of the type can beordered.

Page 4: Faster persistent data structures through hashing

Real world performance of Data.Map

I Good in theory: no more than O(log n) comparisons.

I Not great in practice: up to O(log n) comparisons!

I Many common types are expensive to compare e.g String,ByteString, and Text.

I Given a string of length k, we need O(k ∗ log n) comparisonsto look up an entry.

Page 5: Faster persistent data structures through hashing

Hash tables

I Hash tables perform well with string keys: O(k) amortizedlookup time for strings of length k .

I However, we want persistent maps, not mutable hash tables.

Page 6: Faster persistent data structures through hashing

Milan Straka’s idea: Patricia trees as sparse arrays

I We can use hashing without using hash tables!

I A Patricia tree implements a persistent, sparse array.

I Patricia trees are as twice fast as size balanced trees, but onlywork with Int keys.

I Use hashing to derive an Int from an arbitrary key.

Page 7: Faster persistent data structures through hashing

Implementation tricks

I Patricia trees implement a sparse, persistent array of size 232

(or 264).

I Hashing using this many buckets makes collisions rare: for 224

entries we expect about 32,000 single collisions.

I Linked lists are a perfectly adequate collision resolutionstrategy.

Page 8: Faster persistent data structures through hashing

First attempt at an implementation

−− Def ined i n the c o n t a i n e r s package .data IntMap a

= N i l| Tip {−# UNPACK #−} ! Key a| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( IntMap a ) ! ( IntMap a )

type P r e f i x = Inttype Mask = Inttype Key = Int

newtype HashMap k v = HashMap ( IntMap [ ( k , v ) ] )

Page 9: Faster persistent data structures through hashing

A more memory efficient implementation

data HashMap k v= N i l| Tip {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( FL . F u l l L i s t k v )| Bin {−# UNPACK #−} ! P r e f i x

{−# UNPACK #−} ! Mask! ( HashMap k v ) ! ( HashMap k v )

type P r e f i x = Inttype Mask = Inttype Hash = Int

data F u l l L i s t k v = FL ! k ! v ! ( L i s t k v )data L i s t k v = N i l | Cons ! k ! v ! ( L i s t k v )

Page 10: Faster persistent data structures through hashing

Reducing the memory footprint

I List k v uses 2 fewer words per key/value pair than [(k, v)] .

I FullList can be unpacked into the Tip constructor as it’s aproduct type, saving 2 more words.

I Always unpack word sized types, like Int, unless you reallyneed them to be lazy.

Page 11: Faster persistent data structures through hashing

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMapinsert 1.00 0.43

lookup 1.00 0.28

delete performs like insert .

Page 12: Faster persistent data structures through hashing

Can we do better?

I We still need to perform O(min(W , log n)) Int comparisons,where W is the number of bits in a word.

I The memory overhead per key/value pair is still quite high.

Page 13: Faster persistent data structures through hashing

Borrowing from our neighbours

I Clojure uses a hash-array mapped trie (HAMT) data structureto implement persistent maps.

I Described in the paper “Ideal Hash Trees” by Bagwell (2001).

I Originally a mutable data structure implemented in C++.

I Clojure’s persistent version was created by Rich Hickey.

Page 14: Faster persistent data structures through hashing

Hash-array mapped tries in Clojure

I Shallow tree with high branching factor.

I Each node, except the leaf nodes, contains an array of up to32 elements.

I 5 bits of the hash are used to index the array at each level.

I A clever trick, using bit population count, is used to representspare arrays.

Page 15: Faster persistent data structures through hashing

The Haskell definition of a HAMT

data HashMap k v= Empty| BitmapIndexed

{−# UNPACK #−} ! Bitmap{−# UNPACK #−} ! ( Array ( HashMap k v ) )

| L e a f {−# UNPACK #−} ! ( L e a f k v )| F u l l {−# UNPACK #−} ! ( Array ( HashMap k v ) )| C o l l i s i o n {−# UNPACK #−} ! Hash

{−# UNPACK #−} ! ( Array ( L e a f k v ) )

type Bitmap = Word

data Array a = Array ! ( Array# a ){−# UNPACK #−} ! Int

Page 16: Faster persistent data structures through hashing

Making it fast

I The initial implementation by Edward Z. Yang: correct butdidn’t perform well.

I Improved performance byI replacing use of Data.Vector by a specialized array type,I paying careful attention to strictness, andI using GHC’s new INLINABLE pragma.

Page 17: Faster persistent data structures through hashing

Benchmarks

Keys: 212 random 8-byte ByteStrings

Map HashMap HashMap (HAMT)insert 1.00 0.43 1.21

lookup 1.00 0.28 0.21

Page 18: Faster persistent data structures through hashing

Where is all the time spent?

I Most time in insert is spent copying small arrays.

I Array copying is implemented using indexArray# andwriteArray#, which results in poor performance.

I When cloning an array, we are force to first fill the new arraywith dummy elements, followed by copying over the elementsfrom the old array.

Page 19: Faster persistent data structures through hashing

A better array copy

I Daniel Peebles have implemented a set of new primops forcopying array in GHC.

I The first implementation showed a 20% performanceimprovement for insert .

I Copying arrays is still slow so there might be room for bigimprovements still.

Page 20: Faster persistent data structures through hashing

Other possible performance improvements

I Even faster array copying using SSE instructions, inlinememory allocation, and CMM inlining.

I Use dedicated bit population count instruction onarchitectures where it’s available.

I Clojure uses a clever trick to unpack keys and values directlyinto the arrays; keys are stored at even positions and values atodd positions.

I GHC 7.2 will use 1 less word for Array.

Page 21: Faster persistent data structures through hashing

Optimize common cases

I In many cases maps are created in one go from a sequence ofkey/value pairs.

I We can optimize for this case by repeatedly mutating theHAMT and freeze it when we’re done.

Keys: 212 random 8-byte ByteStrings

fromList/pure 1.00

fromList/mutating 0.50

Page 22: Faster persistent data structures through hashing

Abstracting over collection types

I We will soon have two map types worth using (one orderedand one unordered).

I We want to write function that work with both types, withouta O(n) conversion cost.

I Use type families to abstract over different concreteimplementation.

Page 23: Faster persistent data structures through hashing

Summary

I Hashing allows us to create more efficient data structures.

I There are new interesting data structures out there that have,or could have, an efficient persistent implementation.