reconfigurable accelerator for the

7/27/2019 Reconfigurable Accelerator for The

1/46

Reconfigurable Accelerator for the

Word-Matching Stage of BLASTN

Abstract

BLAST is one of the most popular sequence analysis tools used by molecular

biologists. It is designed to efficiently find similar regions between two sequences that

have biological significance. However, because the size of genomic databases is growing

rapidly, the computation time of BLAST, when performing a complete genomic database

search, is continuously increasing. Thus, there is a clear need to accelerate this process.

In this paper, we present a new approach for genomic sequence database scanning

utilizing reconfigurable field programmable gate array (FPGA)-based hardware. In order

to derive an efficient structure for BLASTN, we propose a reconfigurable architecture to

accelerate the computation of the word-matching stage. The experimental results show

that the FPGA implementation achieves a speedup around one order of magnitude

compared to the NCBI BLASTN software running on a general purpose computer.

INTRODUCTION

Scanning genomic sequence databases is a common and often repeated task in molecular

biology. The need for speeding up these searches comes from the rapid growth of these

gene banks: every year their size is scaled by a factor of 1.5 to 2. The aim of a scan

operation is to find similarities between the query sequence and a particular genome

sequence, which might indicate similar functionality from a biological point of view.

Dynamic programming-based alignment algorithms can guarantee to find all important

similarities. However, as the search space is the product of the two sequences, which

could be several billion bases in size, it is generally not feasible to use a direct

implementation. One frequently used approach to speed up this time-consumingoperation is to use heuristics in the search algorithm. One of the most widely used

sequence analysis tools to use heuristics is the basic local alignment search tool (BLAST)

[2]. Although BLASTs algorithms are highly optimized for similarity search, the ever

growing databases outpace the speed improvements that BLAST can provide on a general


2/46

purpose PC. BLASTN, a version of BLAST specifically designed for DNA sequence

searches, consists of a three-stage pipeline.

Stage 1: Word-Matching detect seeds (short exact matches of a certain length between

the query sequence and the subject sequence), the inputs to this stage are strings of DNA

bases, which typically uses the alphabet {A, C, G, T}.

Stage 2: Ungapped Extension extends each seed in both directions allowing substitutions

only and outputs the resulting high-scoring segment pairs (HSPs). An HSP [3] indicates

two sequence fragments with equal length whose alignment score meets or exceeds a

empirically set threshold (or cutoff score).

Stage 3: Gapped Extension uses the Smith-Waterman dynamic programming algorithm

to extend the HSPs allowing insertions and deletions.

The basic idea underlying a BLASTN search is filtration. Although each stage in

the BLASTN pipeline is becoming more sophisticated, the exponential increase in the

volume ofdata makes it important that measures are taken to reduce theamount of data

that needs to be processed. Filtration discards irrelevant fractions as early as possible,

thus reducing the overall computation time. Analysis of the various stages of the

BLASTN pipeline (see Table I) reveals that the word-matchingstage is the most time-

consuming part. Therefore, accelerating the computation of this stage will have the

greatest effect onthe overall performance.

EXISTING SYSTEM

BASIC LOCAL ALIGNMENT SEARCH TOOL

A new approach to rapid sequence comparison, basic local alignment search tool

(BLAST), directly approximates alignments that optimize a measure of local similarity,

the maximal segment pair (MSP) score. Recent mathematical results on the stochasticproperties of MSP scores allow an analysis of the performance of this method as well as

the statistical significance of alignments it generates. The basic algorithm is simple and

robust; it can be implemented in a number of ways and applied in a variety of contexts

including straight-forward DNA and protein sequence database searches, motif searches,

gene identification searches, and in the analysis of multiple regions of similarity in long


3/46

DNA sequences. In addition to its flexibility and tractability to mathematical analysis,

BLAST is an order of magnitude faster than existing sequence comparison tools of

comparable sensitivity.

A RECONFIGURABLE BLOOM FILTER ARCHITECTURE FOR BLASTN

Efficient seed-based filtration methods exist for scanning genomic sequence

databases. However, current solutions require a significant scan time on traditional

computer architectures. These scan time requirements are likely to become even more

severe due to the rapid growth in the size of databases. In this paper, we present a new

approach to genomic sequence database scanning using reconfigurable field-

programmable gate array (FPGA)-based hardware. To derive an efficient mapping onto

this type of architecture, we propose a reconfigurable Bloom filter architecture. Our

experimental results show that the FPGA implementation achieves an order of magnitude

speedup compared to the NCBI BLASTN software running on a general purpose

computer.

EFFICIENT HARDWARE HASHING FUNCTIONS FOR HIGH

PERFORMANCE COMPUTERS

Hashing is critical for high performance computer architecture. Hashing is used

extensively in hardware applications, such as page tables, for address translation. Bit

extraction and exclusive ORing hashing methods are two commonly used hashing

functions for hardware applications. There is no study of the performance of these

functions and no mention anywhere of the practical performance of the hashing functions

in comparison with the theoretical performance prediction of hashing schemes. In this

paper, we show that, by choosing hashing functions at random from a particular class,

called H3, of hashing functions, the analytical performance of hashing can be achieved in

practice on real-life data. Our results about the expected worst case performance of

hashing are of special significance, as they provide evidence for earlier theoretical

predictions.

AN APPROACH FOR MINIMAL PERFECT HASH


4/46

FUNCTIONS FOR VERY LARGE DATABASES

We propose a novel external memory based algorithm for constructing minimal

perfect hash functions h for huge sets of keys. For a set of n keys, our algorithm outputs h

in time O(n). The algorithm needs a small vector of one byte entries in main memory to

construct h. The evaluation of h(x) requires three memory accesses for any key x. The

description of h takes a constant number of up to 9 bits for each key, which is optimal

and close to the theoretical lower bound, i.e., around 2 bits per key. In our experiments,

we used a collection of 1 billion URLs collected from the web, each URL 64 characters

long on average. For this collection, our algorithm (i) finds a minimal perfect hash

function in approximately 3 hours using a commodity PC, (ii) needs just 5.45 megabytes

of internal memory to generate h and (iii) takes 8.1 bits per key for the description of h.

MERCURY BLAST DICTIONARIES: ANALYSIS AND PERFORMANCE

MEASUREMENT

This report describes a hashing scheme for a dictionary of short bit strings. The

scheme, which we call near-perfect hashing, was designed as part of the construction of

Mercury BLAST, an FPGA-based accelerator for the BLAST family of biosequence

comparison algorithms.

Near-perfect hashing is a heuristic variant of the well-known displacement

hashing approach to building perfect hash functions. It uses a family of hash functions

composed from linear transformations on bit vectors and lookups in small precomputed

tables, both of which are especially appropriate for implementation in hardware logic. We

show empirically that for inputs derived from genomic DNA sequences, our scheme

obtains a good tradeoff between the size of the hash table and the time required to ompute

it from a set of input strings, while generating few or no collisions between keys in the

table.

One of the building blocks of our scheme is the H_3 family of hash functions,

which are linear transformations on bit vectors. We show that the uniformity of hashing

performed with randomly chosen linear transformations depends critically on their rank,

and that randomly chosen transformations have a high probability of having the

maximum possible uniformity. A simple test is sufficient to ensure that a randomly


5/46

chosen H_3 hash function will not cause an unexpectedly large number of collisions.

Moreover, if two such functions are chosen independently at random, the second function

is unlikely to hash together two keys that were hashed together by the first.

Hashing schemes based on H_3 hash functions therefore tend to distribute their

inputs more uniformly than would be expected under a simple uniform hashing model,

and schemes using pairs of these functions are more uniform than would be assumed for

a pair of independent hash functions.

PROPOSED SYSTEM

In this paper, we propose a computationally efficient architecture to accelerate the

data processing of the word-matching stage based on field programmable gate arrays

(FPGA). FPGAs are suitable candidate platforms for high-performance computation due

to their fine-grained parallelism and pipelining capabilities.

BLOOM FILTERS

Introduction

Bloom filters [2] are compact data structures for probabilistic representation of a set in

order to support membership queries (i.e. queries that ask: Is elementXin set Y?). This

compact representation is the payoff for allowing a small rate offalse positives inmembership queries; that is, queries might incorrectly recognize an element as member

of the set.

We succinctly present Bloom filters use to date in the next section. In Section 3 we

describe Bloom filters in detail, and in Section 4 we give a hopefully precise picture ofspace/computing time/error rate tradeoffs.

Usage

Since their introduction in [2], Bloom filters have seen various uses:


6/46

Web cache sharing([3]) Collaborating Web caches use Bloom filters (dubbed cachesummaries) as compact representations for the local set of cached files. Each cache

periodically broadcasts its summary to all other members of the distributed cache.

Using all summaries received, a cache node has a (partially outdated, partially wrong)

global image about the set of files stored in the aggregated cache. The Squid Web

Proxy Cache [1] uses Cache Digests based on a similar idea.

Query filtering and routing ([4, 6, 7]) The Secure wide-area Discovery Service[6], subsystem of Ninja project [5], organizes service providers in a hierarchy. Bloom

filters are used as summaries for the set of services offered by a node. Summaries are

sent upwards in the hierarchy and aggregated. A query is a description for a specific

service, also represented as a Bloom filter. Thus, when a member node of the hierarchy

generates/receives a query, it has enough information at hand to decide where to forward

the query: downward, to one of its descendants (if a solution to the query is present in the

filter for the corresponding node), or upward, toward its parent (otherwise).

The OceanStore [7] replica location service uses a two-tiered approach: first it initiates an

inexpensive, probabilistic search (based on Bloom filters, similar to Ninja) to try and find

a replica. If this fails, the search falls-back on (expensive) deterministic algorithm (based

on Plaxton replica location algorithm). Alas, their description of the probabilistic search

algorithm is laconic. (An unpublished text [11] from members of the same group gives

some more details. But this does not seem to work well when resources are dynamic.)

Compact representation of a differential file ([9]). A differential file contains abatch of database records to be updated. For performance reasons the database is

updated only periodically (i.e., midnight) or when the differential file grows above a

certain threshold. However, in order to preserve integrity, each reference/query to the

database has to access the differential file to see if a particular record is scheduled to be

updated. To speed-up this process, with little memory and computational overhead, the

differential file is represented as a Bloom filter.

Free text searching ([10]). Basically, the set of words that appear in a text issuccinctly represented using a Bloom filter


7/46

Constructing Bloom Filters

Consider a set },...,,{ 21 naaaA of n elements. Bloom filters describe membership

information ofA using a bit vectorVof length m. For this, khash functions, khhh ,...,, 21

with }..1{: mXhi , are used as described below:

The following procedure builds an m bits Bloom filter, corresponding to a set A and

using khhh ,...,, 21 hash functions:

Procedure BloomFilter(set A, hash_functions, integer m)

returns filter

filter = allocate m bits initialized to 0

foreachai inA:

foreach hash function hj:

filter[hj(ai)] = 1

end foreach

end foreach

return filter

Therefore, if ai is member of a set A, in the resulting Bloom filter V all bits obtained

corresponding to the hashed values of ai are set to 1. Testing for membership of an

element elm is equivalent to testing that all corresponding bits ofVare set:

Procedure MembershipTest (elm, filter, hash_functions)

returns yes/no

foreach hash function hj:


8/46

iffilter[hj(elm)] != 1 return No

end foreach

return Yes

Nice features: filters can be built incrementally: as new elements are added to a set the

corresponding positions are computed through the hash functions and bits are set in the

filter. Moreover, the filter expressing the reunion of two sets is simply computed as the

bit-wise OR applied over the two corresponding Bloom filters.

Bloom Filters

the Math (this follows the description in [3])One prominent feature of Bloom filters is that there is a clear tradeoff between the size of

the filter and the rate of false positives. Observe that after inserting n keys into a filter of

size m using khash functions, the probability that a particular bit is still 0 is:

m

knkn

em

p

1

110 . (1)

(Note that we assume perfect hash functions that spread the elements of A evenly

throughout the space {1..m}. In practice, good results have been achieved using MD5

and other hash functions [10].)

Hence, the probability of a false positive (the probability that all k bits have been

previously set) is:

k

m

knk

kn

k

err em

pp

11

111 0 (2)

In (2) perr is minimized for 2lnn

mk hash functions. In practice however, only a small

number of hash functions are used. The reason is that the computational overhead of

each hash additional function is constant while the incremental benefit of adding a new

hash function decreases after a certain threshold (see Figure 1).


9/46

Figure 1: False positive rate as a function

of the number of hash functions used. The

size of the Bloom filter is 32 bits per entry

(m/n=32). In this case using 22 hash

functions minimizes the false positive rate.

Note however that adding a hash function

does not significantly decrease the error

rate when more than 10 hashes are already

used.

Figure 2: Size of Bloom filter (bits/entry)

as a function of the error rate desired.

Different lines represent different numbers

of hash keys used. Note that, for the error

rates considered, using 32 keys does not

bring significant benefits over using only 8

keys.

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1 4 7 10 13 16 19 22 25 28 31

Falsepositives

rate(logscale)

Number of hash functions

0

10

20

30

40

50

60

70

1.E-06 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01

Bits

perentry

Error rate (log scale)

k=2

k=4

k=8

k=16

k=32


10/46

(2) is the base formula for engineering Bloom filters. It allows, for example, computing minimal

memory requirements (filter size) and number of hash functions given the maximum acceptable

false positives rate and number of elements in the set (as we detail in Figure 2).

k

perr

e

knm

ln

1ln

(bits per entry) (3)

To summarize: Bloom filters are compact data structures for probabilistic representation of a set

in order to support membership queries. The main design tradeoffs are the number of hash

functions used (driving the computational overhead), the size of the filter and the error (collision)

rate. Formula (2) is the main formula to tune parameters according to application requirements.

Compressed Bloom filters

Some applications that use Bloom filters need to communicate these filters across the network.

In this case, besides the three performance metrics we have seen so far: (1) the computational

overhead to lookup a value (related to the number of hash functions used), (2) the size of the

filter in memory, and (3) the error rate, a fourth metric can be used: the size of the filter

transmitted across the network. M. Mitzenmacher shows in [8] that compressing Bloom filters

might lead to significant bandwidth savings at the cost of higher memory requirements (larger

uncompressed filters) and some additional computation time to compress the filter that is sent

across the network. We do not detail here all theoretical and practical issues analyzed in [8].

A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-

efficient probabilistic data structure that is used to test whether an element is a member of

a set. False positive matches are possible, but false negatives are not; i.e. a query returns either

"inside set (may be wrong)" or "definitely not in set". Elements can be added to the set, but not

removed (though this can be addressed with a "counting" filter). The more elements that are

added to the set, the larger the probability of false positives.

Bloom proposed the technique for applications where the amount of source data would

require an impracticably large hash area in memory if "conventional" error-free hashing

techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of
http://en.wikipedia.org/w/index.php?title=Burton_Howard_Bloom&action=edit&redlink=1http://en.wikipedia.org/wiki/Probabilistichttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Element_(mathematics)http://en.wikipedia.org/wiki/Set_(computer_science)http://en.wikipedia.org/wiki/Type_I_and_type_II_errorshttp://en.wikipedia.org/wiki/Type_I_and_type_II_errorshttp://en.wikipedia.org/wiki/Hyphenation_algorithmhttp://en.wikipedia.org/wiki/Hyphenation_algorithmhttp://en.wikipedia.org/wiki/Type_I_and_type_II_errorshttp://en.wikipedia.org/wiki/Type_I_and_type_II_errorshttp://en.wikipedia.org/wiki/Set_(computer_science)http://en.wikipedia.org/wiki/Element_(mathematics)http://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Probabilistichttp://en.wikipedia.org/w/index.php?title=Burton_Howard_Bloom&action=edit&redlink=1


11/46

500,000 words, of which 90% could be hyphenated by following simple rules but all the

remaining 50,000 words required expensive disk access to retrieve their specific patterns. With

unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk

access. But if core memory was insufficient, a smaller hash area could be used to eliminate most

of the unnecessary access. For example, a hash area only 15% of the error-free size would still

eliminate 85% of the disk accesses (Bloom (1970)).

More generally, fewer than 10 bits per element are required for a 1% false positive probability,

independent of the size or number of elements in the set (Bonomi et al. (2006)).

Algorithm description

An example of a Bloom filter, representing the set {x,y,z}. The colored arrows show the

positions in the bit array that each set element is mapped to. The element w is not in the set {x, y,

z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

An empty Bloom filter is a bit array ofm bits, all set to 0. There must also be kdifferent hash

functions defined, each of which maps or hashes some set element to one of the m array positions

with a uniform random distribution.

To add an element, feed it to each of the khash functions to get karray positions. Set the bits at

all these positions to 1.

To query for an element (test whether it is in the set), feed it to each of the khash functions to

get karray positions. If any of the bits at these positions are 0, the element is definitely not in the

setif it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then

either the element is in the set, or the bits have by chance been set to 1 during the insertion of
http://en.wikipedia.org/wiki/Bloom_filter#CITEREFBloom1970http://en.wikipedia.org/wiki/Bloom_filter#CITEREFBonomiMitzenmacherPanigrahySingh2006http://en.wikipedia.org/wiki/Bit_arrayhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Map_(mathematics)http://en.wikipedia.org/wiki/File:Bloom_filter.svghttp://en.wikipedia.org/wiki/Map_(mathematics)http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Bit_arrayhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFBonomiMitzenmacherPanigrahySingh2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFBloom1970


12/46

other elements, resulting in a false positive. In a simple bloom filter, there is no way to

distinguish between the two cases, but more advanced techniques can address this problem.

The requirement of designing kdifferent independent hash functions can be prohibitive for

large k. For a good hash functionwith a wide output, there should be little if any correlationbetween different bit-fields of such a hash, so this type of hash can be used to generate multiple

"different" hash functions by slicing its output into multiple bit fields. Alternatively, one can

pass kdifferent initial values (such as 0, 1, ..., k 1) to a hash function that takes an initial value;

or add (or append) these values to the key. For largerm and/ork, independence among the hash

functions can be relaxed with negligible increase in false positive rate (Dillinger & Manolios

(2004a), Kirsch & Mitzenmacher (2006)). Specifically, Dillinger & Manolios (2004b) show the

effectiveness of deriving the kindices using enhanced double hashing ortriple hashing, variants

ofdouble hashing that are effectively simple random number generators seeded with the two or

three hash values.

Removing an element from this simple Bloom filter is impossible because false negatives are not

permitted. An element maps to kbits, and although setting any one of those kbits to zero suffices

to remove the element, it also results in removing any other elements that happen to map onto

that bit. Since there is no way of determining whether any other elements have been added that

affect the bits for an element to be removed, clearing any of the bits would introduce the

possibility for false negatives.

One-time removal of an element from a Bloom filter can be simulated by having a second Bloom

filter that contains items that have been removed. However, false positives in the second filter

become false negatives in the composite filter, which may be undesirable. In this approach re-

adding a previously removed item is not possible, as one would have to remove it from the

"removed" filter.

It is often the case that all the keys are available but are expensive to enumerate (for example,

requiring many disk reads). When the false positive rate gets too high, the filter can be

regenerated; this should be a relatively rare event.

Space and time advantages
http://en.wikipedia.org/wiki/False_positivehttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004ahttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004ahttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFKirschMitzenmacher2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004bhttp://en.wikipedia.org/w/index.php?title=Enhanced_double_hashing&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Triple_hashing&action=edit&redlink=1http://en.wikipedia.org/wiki/Double_hashinghttp://en.wikipedia.org/wiki/Double_hashinghttp://en.wikipedia.org/w/index.php?title=Triple_hashing&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Enhanced_double_hashing&action=edit&redlink=1http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004bhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFKirschMitzenmacher2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004ahttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004ahttp://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/False_positive


13/46

Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk

which has slow access times. Bloom filter decisions are much faster. However some unnecessary

disk accesses are made when the filter reports a positive (in order to weed out the false

positives). Overall answer speed is better with the Bloom filter than without the Bloom filter.

Use of a Bloom filter for this purpose, however, does increase memory usage.

While risking false positives, Bloom filters have a strong space advantage over other data

structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or

simple arrays orlinked lists of the entries. Most of these require storing at least the data items

themselves, which can require anywhere from a small number of bits, for small integers, to an

arbitrary number of bits, such as for strings (tries are an exception, since they can share storage

between elements with equal prefixes). Linked structures incur an additional linear space

overhead for pointers. A Bloom filter with 1% error and an optimal value ofk, in contrast,

requires only about 9.6 bits per elementregardless of the size of the elements. This advantage

comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature.

The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per

element.

However, if the number of potential values is small and many of them can be in the set, the

Bloom filter is easily surpassed by the deterministic bit array, which requires only one bit for
http://en.wikipedia.org/wiki/Self-balancing_binary_search_treehttp://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Array_data_structurehttp://en.wikipedia.org/wiki/Linked_listhttp://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Bit_arrayhttp://en.wikipedia.org/wiki/File:Bloom_filter_speed.svghttp://en.wikipedia.org/wiki/Bit_arrayhttp://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Linked_listhttp://en.wikipedia.org/wiki/Array_data_structurehttp://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Triehttp://en.wikipedia.org/wiki/Self-balancing_binary_search_tree


14/46

each potential element. Note also that hash tables gain a space and time advantage if they begin

ignoring collisions and store only whether each bucket contains an entry; in this case, they have

effectively become Bloom filters with k= 1.[1]

Bloom filters also have the unusual property that the time needed either to add items or to checkwhether an item is in the set is a fixed constant, O(k), completely independent of the number of

items already in the set. No other constant-space set data structure has this property, but the

average access time of sparse hash tables can make them faster in practice than some Bloom

filters. In a hardware implementation, however, the Bloom filter shines because its klookups are

independent and can be parallelized.

To understand its space efficiency, it is instructive to compare the general Bloom filter with its

special case when k= 1. Ifk= 1, then in order to keep the false positive rate sufficiently low, a

small fraction of bits should be set, which means the array must be very large and contain long

runs of zeros. The information content of the array relative to its size is low. The generalized

Bloom filter (kgreater than 1) allows many more bits to be set while still maintaining a low false

positive rate; if the parameters (kand m) are chosen well, about half of the bits will be set, and

these will be apparently random, minimizing redundancy and maximizing information content.

Probability of false positives
http://en.wikipedia.org/wiki/Bloom_filter#cite_note-1http://en.wikipedia.org/wiki/Bloom_filter#cite_note-1http://en.wikipedia.org/wiki/Bloom_filter#cite_note-1http://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Information_contenthttp://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svghttp://en.wikipedia.org/wiki/Information_contenthttp://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-1


15/46

The false positive probability as a function of number of elements in the filter and the filter

size . An optimal number of hash functions has been assumed.

Assume that a hash function selects each array position with equal probability. Ifm is the

number of bits in the array, and kis the number of hash functions, then the probability that a

certain bit is not set to 1 by a certain hash function during the insertion of an element is then

The probability that it is not set to 1 by any of the hash functions is

If we have inserted n elements, the probability that a certain bit is still 0 is

the probability that it is 1 is therefore

Now test membership of an element that is not in the set. Each of the karray positions computed

by the hash functions is 1 with a probability as above. The probability of all of them being 1,

which would cause the algorithm to erroneously claim that the element is in the set, is often

given as

This is not strictly correct as it assumes independence for the probabilities of each bit being set.However, assuming it is a close approximation we have that the probability of false positives

decreases as m (the number of bits in the array) increases, and increases as n (the number of

inserted elements) increases. For a given m and n, the value ofk(the number of hash functions)

that minimizes the probability is
http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Algorithmhttp://en.wikipedia.org/wiki/Hash_function


16/46

which gives

The required number of bits m, given n (the number of inserted elements) and a desired false

positive probabilityp (and assuming the optimal value ofkis used) can be computed by

substituting the optimal value ofkin the probability expression above:

which can be simplified to:

This results in:

This means that for a given false positive probabilityp, the length of a Bloom filterm is

proportionate to the number of elements being filtered n.[2]

While the above formula is

asymptotic (i.e. applicable as m,n ), the agreement with finite values ofm,n is also quite

good; the false positive probability for a finite bloom filter with m bits, n elements, and khash

functions is at most

So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at

most one fewer bit.[3]

Approximating the number of items in a Bloom filter

Swamidass & Baldi (2007) showed that the number of items in a bloom filter can be

approximated with the following formula,
http://en.wikipedia.org/wiki/Bloom_filter#cite_note-2http://en.wikipedia.org/wiki/Bloom_filter#cite_note-2http://en.wikipedia.org/wiki/Bloom_filter#cite_note-2http://en.wikipedia.org/wiki/Bloom_filter#cite_note-3http://en.wikipedia.org/wiki/Bloom_filter#cite_note-3http://en.wikipedia.org/wiki/Bloom_filter#cite_note-3http://en.wikipedia.org/wiki/Bloom_filter#CITEREFSwamidassBaldi2007http://en.wikipedia.org/wiki/Bloom_filter#CITEREFSwamidassBaldi2007http://en.wikipedia.org/wiki/Bloom_filter#cite_note-3http://en.wikipedia.org/wiki/Bloom_filter#cite_note-2


17/46

where is an estimate of the number of items in the filter,Nis length of the filter, kis the

number of hash functions per item, andXis the number of bits set to one.

The union and intersection of sets

Bloom filters are a way of compactly representing a set of items. It is common to try and

compute the size of the intersection or union between two sets. Bloom filters can be used to

approximate the size of the intersection and union of two sets. Swamidass & Baldi (2007)

showed that for two bloom filters of length , their counts, respectively can be estimated as

and

.

The size of their union can be estimated as

,

where is the number of bits set to one in either of the two bloom filters. And the

intersection can be estimated as

,

Using the three formulas together.

Interesting properties

Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrary

large number of elements; adding an element never fails due to the data structure "filling up."

However, the false positive rate increases steadily as elements are added until all bits in the filter

are set to 1, at which point allqueries yield a positive result.

Union and intersection of Bloom filters with the same size and set of hash functions can be

implemented with bitwise OR and AND operations, respectively. The union operation on Bloom

filters is lossless in the sense that the resulting Bloom filter is the same as the Bloom filter

created from scratch using the union of the two sets. The intersect operation satisfies a weaker

property: the false positive probability in the resulting Bloom filter is at most the false-positive
http://en.wikipedia.org/wiki/Bloom_filter#CITEREFSwamidassBaldi2007http://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Union_(set_theory)http://en.wikipedia.org/wiki/Intersection_(set_theory)http://en.wikipedia.org/wiki/Bitwise_operationhttp://en.wikipedia.org/wiki/Bitwise_operationhttp://en.wikipedia.org/wiki/Intersection_(set_theory)http://en.wikipedia.org/wiki/Union_(set_theory)http://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFSwamidassBaldi2007


18/46

probability in one of the constituent Bloom filters, but may be larger than the false positive

probability in the Bloom filter created from scratch using the intersection of the two sets. There

are also more accurate estimates of intersection and union[clarification needed]

that are not biased in

this way.[citation needed]

Some kinds ofsuperimposed code can be seen as a Bloom filter implemented with

physical edge-notched cards.

Examples

Google BigTable and Apache Cassandra use Bloom filters to reduce the disk lookups for non-

existent rows or columns. Avoiding costly disk lookups considerably increases the performance

of a database query operation.[4]

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is

first checked against a local Bloom filter and only upon a hit a full check of the URL is

performed.[5]

The Squid Web Proxy Cache uses Bloom filters forcache digests.[6]

Bitcoin uses Bloom filters to verify payments without running a full network node.[7][8]

The Venti archival storage system uses Bloom filters to detect previously stored data.[9]

The SPIN model checkeruses Bloom filters to track the reachable state space for large

verification problems.[10]

The Cascading analytics framework uses Bloomfilters to speed up asymmetric joins, where one

of the joined data sets is significantly larger than the other (often called Bloom join[11]

in the

database literature).[12]

Alternatives

Classic Bloom filters use bits of space per inserted key, where is the false

positive rate of the Bloom filter. However, the space that is strictly necessary for any data

structure playing the same role as a Bloom filter is only per key (Pagh, Pagh & Rao

2005). Hence Bloom filters use 44% more space than a hypothetical equivalent optimal data

structure. The number of hash functions used to achieve a given false positive rate is
http://en.wikipedia.org/wiki/Wikipedia:Please_clarifyhttp://en.wikipedia.org/wiki/Wikipedia:Please_clarifyhttp://en.wikipedia.org/wiki/Wikipedia:Please_clarifyhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Superimposed_codehttp://en.wikipedia.org/wiki/Edge-notched_cardhttp://en.wikipedia.org/wiki/BigTablehttp://en.wikipedia.org/wiki/Apache_Cassandrahttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-4http://en.wikipedia.org/wiki/Bloom_filter#cite_note-4http://en.wikipedia.org/wiki/Bloom_filter#cite_note-4http://en.wikipedia.org/wiki/Google_Chromehttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-5http://en.wikipedia.org/wiki/Bloom_filter#cite_note-5http://en.wikipedia.org/wiki/Bloom_filter#cite_note-5http://en.wikipedia.org/wiki/Squid_(software)http://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Web_cachehttp://wiki.squid-cache.org/SquidFaq/CacheDigestshttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-Wessels172-6http://en.wikipedia.org/wiki/Bloom_filter#cite_note-Wessels172-6http://en.wikipedia.org/wiki/Bloom_filter#cite_note-Wessels172-6http://en.wikipedia.org/wiki/Bitcoinhttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-7http://en.wikipedia.org/wiki/Bloom_filter#cite_note-7http://en.wikipedia.org/wiki/Bloom_filter#cite_note-7http://en.wikipedia.org/wiki/Ventihttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-9http://en.wikipedia.org/wiki/Bloom_filter#cite_note-9http://en.wikipedia.org/wiki/Bloom_filter#cite_note-9http://en.wikipedia.org/wiki/SPIN_model_checkerhttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-10http://en.wikipedia.org/wiki/Bloom_filter#cite_note-10http://en.wikipedia.org/wiki/Bloom_filter#cite_note-10http://en.wikipedia.org/wiki/Cascadinghttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-11http://en.wikipedia.org/wiki/Bloom_filter#cite_note-11http://en.wikipedia.org/wiki/Bloom_filter#cite_note-12http://en.wikipedia.org/wiki/Bloom_filter#cite_note-12http://en.wikipedia.org/wiki/Bloom_filter#cite_note-12http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPaghPaghRao2005http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPaghPaghRao2005http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPaghPaghRao2005http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPaghPaghRao2005http://en.wikipedia.org/wiki/Bloom_filter#cite_note-12http://en.wikipedia.org/wiki/Bloom_filter#cite_note-11http://en.wikipedia.org/wiki/Cascadinghttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-10http://en.wikipedia.org/wiki/SPIN_model_checkerhttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-9http://en.wikipedia.org/wiki/Ventihttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-7http://en.wikipedia.org/wiki/Bloom_filter#cite_note-7http://en.wikipedia.org/wiki/Bitcoinhttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-Wessels172-6http://wiki.squid-cache.org/SquidFaq/CacheDigestshttp://en.wikipedia.org/wiki/Web_cachehttp://en.wikipedia.org/wiki/World_Wide_Webhttp://en.wikipedia.org/wiki/Squid_(software)http://en.wikipedia.org/wiki/Bloom_filter#cite_note-5http://en.wikipedia.org/wiki/Google_Chromehttp://en.wikipedia.org/wiki/Bloom_filter#cite_note-4http://en.wikipedia.org/wiki/Apache_Cassandrahttp://en.wikipedia.org/wiki/BigTablehttp://en.wikipedia.org/wiki/Edge-notched_cardhttp://en.wikipedia.org/wiki/Superimposed_codehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Please_clarify


19/46

proportional to which is not optimal as it has been proved that an optimal data structure

would need only a constant number of hash functions independent of the false positive rate.

Stern & Dill (1996) describe a probabilistic structure based on hash tables, hash compaction,

which Dillinger & Manolios (2004b) identify as significantly more accurate than a Bloom filter

when each is configured optimally. Dillinger and Manolios, however, point out that the

reasonable accuracy of any given Bloom filter over a wide range of numbers of additions makes

it attractive for probabilistic enumeration of state spaces of unknown size. Hash compaction is,

therefore, attractive when the number of additions can be predicted accurately; however, despite

being very fast in software, hash compaction is poorly suited for hardware because of worst-case

linear access time.

Putze, Sanders & Singler (2007) have studied some variants of Bloom filters that are either fasteror use less space than classic Bloom filters. The basic idea of the fast variant is to locate the k

hash values associated with each key into one or two blocks having the same size as processor's

memory cache blocks (usually 64 bytes). This will presumably improve performance by

reducing the number of potential memory cache misses. The proposed variants have however the

drawback of using about 32% more space than classic Bloom filters.

The space efficient variant relies on using a single hash function that generates for each key a

value in the range where is the requested false positive rate. The sequence of values

is then sorted and compressed using Golomb coding (or some other compression technique) to

occupy a space close to bits. To query the Bloom filter for a given key, it will

suffice to check if its corresponding value is stored in the Bloom filter. Decompressing the whole

Bloom filter for each query would make this variant totally unusable. To overcome this problem

the sequence of values is divided into small blocks of equal size that are compressed separately.

At query time only half a block will need to be decompressed on average. Because of

decompression overhead, this variant may be slower than classic Bloom filters but this may becompensated by the fact that a single hash function need to be computed.

Another alternative to classic Bloom filter is the one based on space efficient variants ofcuckoo

hashing. In this case once the hash table is constructed, the keys stored in the hash table are
http://en.wikipedia.org/wiki/Bloom_filter#CITEREFSternDill1996http://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/w/index.php?title=Hash_compaction&action=edit&redlink=1http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004bhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFPutzeSandersSingler2007http://en.wikipedia.org/wiki/Cache_misseshttp://en.wikipedia.org/wiki/Golomb_codinghttp://en.wikipedia.org/wiki/Cuckoo_hashinghttp://en.wikipedia.org/wiki/Cuckoo_hashinghttp://en.wikipedia.org/wiki/Cuckoo_hashinghttp://en.wikipedia.org/wiki/Cuckoo_hashinghttp://en.wikipedia.org/wiki/Golomb_codinghttp://en.wikipedia.org/wiki/Cache_misseshttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFPutzeSandersSingler2007http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDillingerManolios2004bhttp://en.wikipedia.org/w/index.php?title=Hash_compaction&action=edit&redlink=1http://en.wikipedia.org/wiki/Hash_tablehttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFSternDill1996


20/46

replaced with short signatures of the keys. Those signatures are strings of bits computed using a

hash function applied on the keys.

Extensions and applications

Counting filters

Counting filters provide a way to implement a delete operation on a Bloom filter without

recreating the filter afresh. In a counting filter the array positions (buckets) are extended from

being a single bit to being an n-bit counter. In fact, regular Bloom filters can be considered as

counting filters with a bucket size of one bit. Counting filters were introduced by Fan et al.

(1998).

The insert operation is extended to incrementthe value of the buckets and the lookup operation

checks that each of the required buckets is non-zero. The delete operation, obviously, then

consists of decrementing the value of each of the respective buckets.

Arithmetic overflow of the buckets is a problem and the buckets should be sufficiently large to

make this case rare. If it does occur then the increment and decrement operations must leave the

bucket set to the maximum possible value in order to retain the properties of a Bloom filter.

The size of counters is usually 3 or 4 bits. Hence counting Bloom filters use 3 to 4 times more

space than static Bloom filters. In theory, an optimal data structure equivalent to a counting

Bloom filter should not use more space than a static Bloom filter.

Another issue with counting filters is limited scalability. Because the counting Bloom filter table

cannot be expanded, the maximal number of keys to be stored simultaneously in the filter must

be known in advance. Once the designed capacity of the table is exceeded, the false positive rate

will grow rapidly as more keys are inserted.

Bonomi et al. (2006) introduced a data structure based on d-left hashing that is functionally

equivalent but uses approximately half as much space as counting Bloom filters. The scalabilityissue does not occur in this data structure. Once the designed capacity is exceeded, the keys

could be reinserted in a new hash table of double size.

The space efficient variant by Putze, Sanders & Singler (2007) could also be used to implement

counting filters by supporting insertions and deletions.
http://en.wikipedia.org/wiki/Bloom_filter#CITEREFFanCaoAlmeidaBroder1998http://en.wikipedia.org/wiki/Bloom_filter#CITEREFFanCaoAlmeidaBroder1998http://en.wikipedia.org/wiki/Arithmetic_overflowhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFBonomiMitzenmacherPanigrahySingh2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPutzeSandersSingler2007http://en.wikipedia.org/wiki/Bloom_filter#CITEREFPutzeSandersSingler2007http://en.wikipedia.org/wiki/Bloom_filter#CITEREFBonomiMitzenmacherPanigrahySingh2006http://en.wikipedia.org/wiki/Arithmetic_overflowhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFFanCaoAlmeidaBroder1998http://en.wikipedia.org/wiki/Bloom_filter#CITEREFFanCaoAlmeidaBroder1998


21/46

Data synchronization

Bloom filters can be used for approximate data synchronization as in Byers et al. (2004).

Counting Bloom filters can be used to approximate the number of differences between two sets

and this approach is described in Agarwal & Trachtenberg (2006).

Bloomier filters

Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with

each element that had been inserted, implementing an associative array. Like Bloom filters, these

structures achieve a small space overhead by accepting a small probability of false positives. In

the case of "Bloomier filters", afalse positive is defined as returning a result when the key is not

in the map. The map will never return the wrong value for a key that is in the map.

Compact approximators

Boldi & Vigna (2005) proposed a lattice-based generalization of Bloom filters. A compact

approximator associates to each key an element of a lattice (the standard Bloom filters being

the case of the Boolean two-element lattice). Instead of a bit array, they have an array of lattice

elements. When adding a new association between a key and an element of the lattice, they

compute the maximum of the current contents of the karray locations associated to the key with

the lattice element. When reading the value associated to a key, they compute the minimum of

the values found in the klocations associated to the key. The resulting value approximates from

above the original value.

Stable Bloom filters

Deng & Rafiei (2006) proposed Stable Bloom filters as a variant of Bloom filters for streaming

data. The idea is that since there is no way to store the entire history of a stream (which can be

infinite), Stable Bloom filters continuously evict stale information to make room for more recent

elements. Since stale information is evicted, the Stable Bloom filter introduces false negatives,

which do not appear in traditional bloom filters. The authors show that a tight upper bound of

false positive rates is guaranteed, and the method is superior to standard bloom filters in terms of

false positive rates and time efficiency when a small space and an acceptable false positive rate

are given.
http://en.wikipedia.org/wiki/Data_synchronizationhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFByersConsidineMitzenmacherRost2004http://en.wikipedia.org/wiki/Bloom_filter#CITEREFAgarwalTrachtenberg2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFChazelleKilianRubinfeldTal2004http://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFBoldiVigna2005http://en.wikipedia.org/wiki/Lattice_(order)http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDengRafiei2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFDengRafiei2006http://en.wikipedia.org/wiki/Lattice_(order)http://en.wikipedia.org/wiki/Bloom_filter#CITEREFBoldiVigna2005http://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFChazelleKilianRubinfeldTal2004http://en.wikipedia.org/wiki/Bloom_filter#CITEREFAgarwalTrachtenberg2006http://en.wikipedia.org/wiki/Bloom_filter#CITEREFByersConsidineMitzenmacherRost2004http://en.wikipedia.org/wiki/Data_synchronization


22/46

Scalable Bloom filters

Almeida et al. (2007) proposed a variant of Bloom filters that can adapt dynamically to the

number of elements stored, while assuring a minimum false positive probability. The technique

is based on sequences of standard bloom filters with increasing capacity and tighter false positive

probabilities, so as to ensure that a maximum false positive probability can be set beforehand,

regardless of the number of elements to be inserted.

Attenuated Bloom filters

An attenuated bloom filter of depth D can be viewed as an array of D normal bloom filters. In the

context of service discovery in a network, each node stores regular and attenuated bloom filters

locally. The regular or local bloom filter indicates which services are offered by the node itself.

The attenuated filter of level i indicates which services can be found on nodes that are i-hopsaway from the current node. The i-th value is constructed by taking a union of local bloom filters

for nodes i-hops away from the node.

Let's take a small network shown on the graph below as an example. Say we are searching for a

service A whose id hashes to bits 0,1, and 3 (pattern 11010). Let n1 node to be the starting point.

First, we check whether service A is offered by n1 by checking its local filter. Since the patterns

don't match, we check the attenuated bloom filter in order to determine which node should be the

next hop. We see that n2 doesn't offer service A but lies on the path to nodes that do. Hence, we

move to n2 and repeat the same procedure. We quickly find that n3 offers the service, and hence

the destination is located.

By using attenuated Bloom filters consisting of multiple layers, services at more than one hop

distance can be discovered while avoiding saturation of the Bloom filter by attenuating (shifting

out) bits set by sources further away.
http://en.wikipedia.org/wiki/Bloom_filter#CITEREFAlmeidaBaqueroPreguicaHutchison2007http://en.wikipedia.org/wiki/File:AttenuatedBloomFilter.pnghttp://en.wikipedia.org/wiki/Bloom_filter#CITEREFAlmeidaBaqueroPreguicaHutchison2007


23/46

HASH TABLE

A small phone book as a hash table

In computing, a hash table (also hash map) is a data structure used to implement an associative

array, a structure that can map keys to values. A hash table uses a hash function to compute

an index into an array ofbuckets orslots, from which the correct value can be found.

Ideally, the hash function should assign each possible key to a unique bucket, but this ideal

situation is rarely achievable in practice (unless the hash keys are fixed; i.e. new entries are never

added to the table after it is created). Instead, most hash table designs assume thathashcollisionsdifferent keys that are assigned by the hash function to the same bucketwill occur

and must be accommodated in some way.

In a well-dimensioned hash table, the average cost (number ofinstructions) for each lookup is

independent of the number of elements stored in the table. Many hash table designs also allow

arbitrary insertions and deletions of key-value pairs, at (amortized[2]

) constant average cost per

operation.[3][4]

In many situations, hash tables turn out to be more efficient than search trees or any

othertable lookup structure. For this reason, they are widely used in many kinds of

computersoftware, particularly for associative arrays, database indexing, caches, and sets.

Hashing
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Unique_keyhttp://en.wikipedia.org/wiki/Value_(computer_science)http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Amortized_analysishttp://en.wikipedia.org/wiki/Amortized_analysishttp://en.wikipedia.org/wiki/Amortized_analysishttp://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Search_treehttp://en.wikipedia.org/wiki/Table_(computing)http://en.wikipedia.org/wiki/Softwarehttp://en.wikipedia.org/wiki/Database_indexhttp://en.wikipedia.org/wiki/Cache_(computing)http://en.wikipedia.org/wiki/Set_(abstract_data_type)http://en.wikipedia.org/wiki/File:Hash_table_3_1_1_0_1_0_0_SP.svghttp://en.wikipedia.org/wiki/Set_(abstract_data_type)http://en.wikipedia.org/wiki/Cache_(computing)http://en.wikipedia.org/wiki/Database_indexhttp://en.wikipedia.org/wiki/Softwarehttp://en.wikipedia.org/wiki/Table_(computing)http://en.wikipedia.org/wiki/Search_treehttp://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Amortized_analysishttp://en.wikipedia.org/wiki/Amortized_analysishttp://en.wikipedia.org/wiki/Instruction_(computer_science)http://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Value_(computer_science)http://en.wikipedia.org/wiki/Unique_keyhttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Associative_arrayhttp://en.wikipedia.org/wiki/Data_structurehttp://en.wikipedia.org/wiki/Computing


24/46

Main article: Hash function

The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given

a key, the algorithm computes an index that suggests where the entry can be found:

index = f(key, array_size)

Often this is done in two steps:

hash = hashfunc(key)

index = hash % array_size

In this method, the hash is independent of the array size, and it is then reducedto an index (a

number between 0 and array_size 1) using the modulus operator (%).

In the case that the array size is a power of two, the remainder operation is reduced to masking,

which improves speed, but can increase problems with a poor hash function.

Choosing a good hash function

A good hash function and implementation algorithm are essential for good hash table

performance, but may be difficult to achieve.

A basic requirement is that the function should provide a uniform distribution of hash values. A

non-uniform distribution increases the number of collisions and the cost of resolving them.

Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using

statistical tests, e.g. a Pearson's chi-squared test for discrete uniform distributions[5]

[6]

The distribution needs to be uniform only for table sizes that occur in the application. In

particular, if one uses dynamic resizing with exact doubling and halving ofs, the hash function

needs to be uniform only whens is a power of two. On the other hand, some hashing algorithms

provide uniform hashes only whens is a prime number.[7]

Foropen addressing schemes, the hash function should also avoid clustering, the mapping of two

or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even
http://en.wikipedia.org/wiki/Hash_functionhttp://en.wikipedia.org/wiki/Power_of_twohttp://en.wikipedia.org/wiki/Mask_(computing)http://en.wikipedia.org/wiki/Uniform_distribution_(discrete)http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Discrete_uniform_distributionhttp://en.wikipedia.org/wiki/Hash_table#cite_note-chernoff-5http://en.wikipedia.org/wiki/Hash_table#cite_note-chernoff-5http://en.wikipedia.org/wiki/Hash_table#cite_note-plackett-6http://en.wikipedia.org/wiki/Hash_table#cite_note-plackett-6http://en.wikipedia.org/wiki/Hash_table#cite_note-plackett-6http://en.wikipedia.org/wiki/Power_of_twohttp://en.wikipedia.org/wiki/Prime_numberhttp://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Open_addressinghttp://en.wikipedia.org/wiki/Open_addressinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Prime_numberhttp://en.wikipedia.org/wiki/Power_of_twohttp://en.wikipedia.org/wiki/Hash_table#cite_note-plackett-6http://en.wikipedia.org/wiki/Hash_table#cite_note-chernoff-5http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test#Discrete_uniform_distributionhttp://en.wikipedia.org/wiki/Uniform_distribution_(discrete)http://en.wikipedia.org/wiki/Mask_(computing)http://en.wikipedia.org/wiki/Power_of_twohttp://en.wikipedia.org/wiki/Hash_function


25/46

if the load factor is low and collisions are infrequent. The popular multiplicative hash[3]

is

claimed to have particularly poor clustering behavior.[7]

Cryptographic hash functions are believed to provide good hash functions for any table sizes,

either by modulo reduction or by bit masking. They may also be appropriate if there is a risk ofmalicious users trying to sabotage a network service by submitting requests designed to generate

a large number of collisions in the server's hash tables. However, the risk of sabotage can also be

avoided by cheaper methods (such as applying a secret salt to the data, or using a universal hash

function).

Some authors claim that good hash functions should have the avalanche effect; that is, a single-

bit change in the input key should affect, on average, half the bits in the output. Some popular

hash functions do not have this property.[citation needed]

Perfect hash function

If all keys are known ahead of time, a perfect hash function can be used to create a perfect hash

table that has no collisions. Ifminimal perfect hashing is used, every location in the hash table

can be used as well.

Perfect hashing allows forconstant time lookups in the worst case. This is in contrast to most

chaining and open addressing methods, where the time for lookup is low on average, but may be

very large (proportional to the number of entries) for some sets of keys.

Key statistics

A critical statistic for a hash table is called the load factor. This is simply the number of entries

divided by the number of buckets, that is, n/kwhere n is the number of entries and kis the

number of buckets.

If the load factor is kept reasonable, the hash table should perform well, provided the hashing is

good. If the load factor grows too large, the hash table will become slow, or it may fail to work

(depending on the method used). The expected constant timeproperty of a hash table assumes

that the load factor is kept below some bound. For afixednumber of buckets, the time for a

lookup grows with the number of entries and so does not achieve the desired constant time.
http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3http://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Cryptographic_hash_functionhttp://en.wikipedia.org/wiki/Modulo_operationhttp://en.wikipedia.org/wiki/Mask_(computing)http://en.wikipedia.org/wiki/Denial_of_service_attackhttp://en.wikipedia.org/wiki/Salt_(cryptography)http://en.wikipedia.org/wiki/Universal_hash_functionhttp://en.wikipedia.org/wiki/Universal_hash_functionhttp://en.wikipedia.org/wiki/Avalanche_effecthttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Perfect_hash_functionhttp://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_functionhttp://en.wikipedia.org/wiki/Constant_timehttp://en.wikipedia.org/wiki/Constant_timehttp://en.wikipedia.org/wiki/Constant_timehttp://en.wikipedia.org/wiki/Constant_timehttp://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_functionhttp://en.wikipedia.org/wiki/Perfect_hash_functionhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Avalanche_effecthttp://en.wikipedia.org/wiki/Universal_hash_functionhttp://en.wikipedia.org/wiki/Universal_hash_functionhttp://en.wikipedia.org/wiki/Salt_(cryptography)http://en.wikipedia.org/wiki/Denial_of_service_attackhttp://en.wikipedia.org/wiki/Mask_(computing)http://en.wikipedia.org/wiki/Modulo_operationhttp://en.wikipedia.org/wiki/Cryptographic_hash_functionhttp://en.wikipedia.org/wiki/Hash_table#cite_note-twang1-7http://en.wikipedia.org/wiki/Hash_table#cite_note-knuth-3


26/46

Second to that, one can examine the variance of number of entries per bucket. For example, two

tables both have 1000 entries and 1000 buckets; one has exactly one entry in each bucket, the

other has all entries in the same bucket. Clearly the hashing is not working in the second one.

A low load factor is not especially beneficial. As load factor approaches 0, the proportion ofunused areas in the hash table increases, but there is not necessarily any reduction in search cost.

This results in wasted memory.

Collision resolution

Hash collisions are practically unavoidable when hashing a random subset of a large set of

possible keys. For example, if 2,500 keys are hashed into a million buckets, even with a perfectly

uniform random distribution, according to the birthday problem there is a 95% chance of at leasttwo of the keys being hashed to the same slot.

Therefore, most hash table implementations have some collision resolution strategy to handle

such events. Some common strategies are described below. All these methods require that the

keys (or pointers to them) be stored in the table, together with the associated values.

Separate chaining
http://en.wikipedia.org/wiki/Collision_(computer_science)http://en.wikipedia.org/wiki/Birthday_problemhttp://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svghttp://en.wikipedia.org/wiki/Birthday_problemhttp://en.wikipedia.org/wiki/Collision_(computer_science)


27/46

Hash collision resolved by separate chaining.

In the method known asseparate chaining, each bucket is independent, and has some sort

oflist of entries with the same index. The time for hash table operations is the time to find the

bucket (which is constant) plus the time for the list operation. (The technique is also called open

hashingorclosed addressing.)

In a good hash table, each bucket has zero or one entries, and sometimes two or three, but rarely

more than that. Therefore, structures that are efficient in time and space for these cases are

preferred. Structures that are efficient for a fairly large number of entries are not needed or

desirable. If these cases happen often, the hashing is not working well, and this needs to be fixed.

Separate chaining with linked lists

Chained hash tables with linked lists are popular because they require only basic data structures

with simple algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired

key. If the distribution of keys is sufficiently uniform, the average cost of a lookup depends only

on the average number of keys per bucketthat is, on the load factor.

Chained hash tables remain effective even when the number of table entries n is much higher

than the number of slots. Their performance degrades more gracefully (linearly) with the load

factor. For example, a chained hash table with 1000 slots and 10,000 stored keys (load factor 10)

is five to ten times slower than a 10,000-slot table (load factor 1); but still 1000 times faster than

a plain sequential list, and possibly even faster than a balanced search tree.

For separate-chaining, the worst-case scenario is when all entries are inserted into the same

bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data

structure. If the latter is a linear list, the lookup procedure may have to scan all its entries, so the

worst-case cost is proportional to the numbern of entries in the table.

The bucket chains are often implemented as ordered lists, sorted by the key field; this choice

approximately halves the average cost of unsuccessful lookups, compared to an unordered

list[citation needed]

. However, if some keys are much more likely to come up than others, an

unordered list with move-to-front heuristic may be more effective. More sophisticated data

structures, such as balanced search trees, are worth considering only if the load factor is large
http://en.wikipedia.org/wiki/List_(abstract_data_type)http://en.wikipedia.org/wiki/Linked_listhttp://en.wikipedia.org/wiki/SUHAhttp://en.wikipedia.org/wiki/Graceful_degradationhttp://en.wikipedia.org/wiki/Sequencehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Move-to-front_heuristichttp://en.wikipedia.org/wiki/Move-to-front_heuristichttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Sequencehttp://en.wikipedia.org/wiki/Graceful_degradationhttp://en.wikipedia.org/wiki/SUHAhttp://en.wikipedia.org/wiki/Linked_listhttp://en.wikipedia.org/wiki/List_(abstract_data_type)


28/46

(about 10 or more), or if the hash distribution is likely to be very non-uniform, or if one must

guarantee good performance even in a worst-case scenario. However, using a larger table and/or

a better hash function may be even more effective in those cases.

Chained hash tables also inherit the disadvantages of linked lists. When storing small keys andvalues, the space overhead of the next pointer in each entry record can be significant. An

additional disadvantage is that traversing a linked list has poorcache performance, making the

processor cache ineffective.

Separate chaining with list heads

Hash collision by separate chaining with head records in the bucket array.

Some chaining implementations store the first record of each chain in the slot array itself.[4]

The

number of pointer traversals is decreased by one for most cases. The purpose is to increase cache

efficiency of hash table access.

The disadvantage is that an empty bucket takes the same space as a bucket with one entry. To

save memory space, such hash tables often have about as many slots as stored entries, meaning

that many slots have two or more entries.

Separate chaining with other structures[edit source]

Instead of a list, one can use any other data structure that supports the required operations. For

example, by using a self-balancing tree, the theoretical worst-case time of common hash table
http://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Hash_table#cite_note-cormen-4http://en.wikipedia.org/wiki/Hash_table#cite_note-cormen-4http://en.wikipedia.org/wiki/Hash_table#cite_note-cormen-4http://en.wikipedia.org/w/index.php?title=Hash_table&action=edit&section=9http://en.wikipedia.org/wiki/Self-balancing_binary_search_treehttp://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_0_LL.svghttp://en.wikipedia.org/wiki/Self-balancing_binary_search_treehttp://en.wikipedia.org/w/index.php?title=Hash_table&action=edit&section=9http://en.wikipedia.org/wiki/Hash_table#cite_note-cormen-4http://en.wikipedia.org/wiki/Locality_of_reference


29/46

operations (insertion, deletion, lookup) can be brought down to O(log n) rather than O(n).

However, this approach is only worth the trouble and extra memory cost if long delays must be

avoided at all costs (e.g. in a real-time application), or if one must guard against many entries

hashed to the same slot (e.g. if one expects extremely non-uniform distributions, or in the case of

web sites or other publicly accessible services, which are vulnerable to malicious key

distributions in requests).

The variant called array hash table uses a dynamic array to store all the entries that hash to the

same slot. Each newly inserted entry gets appended to the end of the dynamic array that is

assigned to the slot. The dynamic array is resized in an exact-fitmanner, meaning it is grown

only by as many bytes as needed. Alternative techniques such as growing the array by block

sizes orpages were found to improve insertion performance, but at a cost in space. This variation

makes more efficient use ofCPU caching and the translation lookaside buffer(TLB), because

slot entries are stored in sequential memory positions. It also dispenses with the next pointers

that are required by linked lists, which saves space. Despite frequent array resizing, space

overheads incurred by operating system such as memory fragmentation, were found to be small.

An elaboration on this approach is the so-called dynamic perfect hashing,[11]

where a bucket that

contains kentries is organized as a perfect hash table with k2

slots. While it uses more memory

(n2

slots forn entries, in the worst case and n*kslots in the average case), this variant has

guaranteed constant worst-case lookup time, and low amortized time for insertion.
http://en.wikipedia.org/wiki/Big_O_notationhttp://en.wikipedia.org/wiki/Big_O_notationhttp://en.wikipedia.org/wiki/Big_O_notationhttp://en.wikipedia.org/w/index.php?title=Array_hash_table&action=edit&redlink=1http://en.wikipedia.org/wiki/Dynamic_arrayhttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/Dynamic_perfect_hashinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-11http://en.wikipedia.org/wiki/Hash_table#cite_note-11http://en.wikipedia.org/wiki/Hash_table#cite_note-11http://en.wikipedia.org/wiki/Hash_table#cite_note-11http://en.wikipedia.org/wiki/Dynamic_perfect_hashinghttp://en.wikipedia.org/wiki/Translation_lookaside_bufferhttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Dynamic_arrayhttp://en.wikipedia.org/w/index.php?title=Array_hash_table&action=edit&redlink=1http://en.wikipedia.org/wiki/Big_O_notation


30/46

Open addressing

Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted

Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously

collided with "John Smith".

In another strategy, called open addressing, all entry records are stored in the bucket array itself.

When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot

and proceeding in someprobe sequence, until an unoccupied slot is found. When searching for

an entry, the buckets are scanned in the same sequence, until either the target record is found, or

an unused array slot is found, which indicates that there is no such key in the table.[12]

The name

"open addressing" refers to the fact that the location ("address") of the item is not determined by

its hash value. (This method is also called closed hashing; it should not be confused with "open

hashing" or "closed addressing" that usually mean separate chaining.)

Well-known probe sequences include:

Linear probing, in which the interval between probes is fixed (usually 1)
http://en.wikipedia.org/wiki/Open_addressinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Linear_probinghttp://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_0_SP.svghttp://en.wikipedia.org/wiki/Linear_probinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Open_addressing


31/46

Quadratic probing, in which the interval between probes is increased by adding thesuccessive outputs of a quadratic polynomial to the starting value given by the original hash

computation

Double hashing, in which the interval between probes is computed by another hash functionA drawback of all these open addressing schemes is that the number of stored entries cannot

exceed the number of slots in the bucket array. In fact, even with good hash functions, their

performance dramatically degrades when the load factor grows beyond 0.7 or so. Thus a more

aggressive resize scheme is needed. Separate linking works correctly with any load factor,

although performance is likely to be reasonable if it is kept below 2 or so. For many applications,

these restrictions mandate the use of dynamic resizing, with its attendant costs.

Open addressing schemes also put more stringent requirements on the hash function: besides

distributing the keys more uniformly over the buckets, the function must also minimize the

clustering of hash values that are consecutive in the probe order. Using separate chaining, the

only concern is that too many objects map to thesame hash value; whether they are adjacent or

nearby is completely irrelevant.

Open addressing only saves memory if the entries are small (less than four times the size of a

pointer) and the load factor is not too small. If the load factor is close to zero (that is, there are

far more buckets than stored entries), open addressing is wasteful even if each entry is just two

words.
http://en.wikipedia.org/wiki/Quadratic_probinghttp://en.wikipedia.org/wiki/Double_hashinghttp://en.wikipedia.org/wiki/File:Hash_table_average_insertion_time.pnghttp://en.wikipedia.org/wiki/Double_hashinghttp://en.wikipedia.org/wiki/Quadratic_probing


32/46

This graph compares the average number of cache misses required to look up elements in tables

with chaining and linear probing. As the table passes the 80%-full mark, linear probing's

performance drastically degrades.

Open addressing avoids the time overhead of allocating each new entry record, and can be

implemented even in the absence of a memory allocator. It also avoids the extra indirection

required to access the first entry of each bucket (that is, usually the only one). It also has

betterlocality of reference, particularly with linear probing. With small record sizes, these

factors can yield better performance than chaining, particularly for lookups.

Hash tables with open addressing are also easier to serialize, because they do not use pointers.

On the other hand, normal open addressing is a poor choice for large elements, because these

elements fill entire CPU cachelines (negating the cache advantage), and a large amount of space

is wasted on large empty table slots. If the open addressing table only stores references to

elements (external storage), it uses space comparable to chaining even for large records but loses

its speed advantage.

Generally speaking, open addressing is better used for hash tables with small records that can be

stored within the table (internal storage) and fit in a cache line. They are particularly suitable for

elements of one word or less. If the table is expected to have a high load factor, the records are

large, or the data is variable-sized, chained hash tables often perform as well or better.

Ultimately, used sensibly, any kind of hash table algorithm is usually fast enough; and the

percentage of a calculation spent in hash table code is low. Memory usage is rarely considered

excessive. Therefore, in most cases the differences between these algorithms are marginal, and

other considerations typically come into play.[citation needed]

Coalesced hashing

A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes

within the table itself.[12]

Like open addressing, it achieves space usage and (somewhat

diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects;

in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more

elements than table slots.
http://en.wikipedia.org/wiki/Locality_of_referencehttp://en.wikipedia.org/wiki/Serializationhttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/Coalesced_hashinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Hash_table#cite_note-tenenbaum90-12http://en.wikipedia.org/wiki/Coalesced_hashinghttp://en.wikipedia.org/wiki/Wikipedia:Citation_neededhttp://en.wikipedia.org/wiki/CPU_cachehttp://en.wikipedia.org/wiki/Serializationhttp://en.wikipedia.org/wiki/Locality_of_reference


33/46

Cuckoo hashing

Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup

time in the worst case, and constant amortized time for insertions and deletions. It uses two or

more hash functions, which means any key/value pair could be in two or more locations. For

lookup, the first hash function is used; if the key/value is not found, then the second hash

function is used, and so on. If a collision happens during insertion, then the key is re-hashed with

the second hash function to map it to another bucket. If all hash functions are used and there is

still a collision, then the key it collided with is removed to make space for the new key, and the

old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that

location also results in a collision, then the process repeats until there is no collision or the

process traverses all the buckets, at which point the table is resized. By combining multiple hash

functions with multiple cells per bucket, very high space utilisation can be achieved.

Robin Hood hashing

One interesting variation on double-hashing collision resolution is Robin Hood hashing.[13]

The

idea is that a new key may displace a key already inserted, if its probe count is larger than that of

the key at the current position. The net effect of this is that it reduces worst case search times in

the table. This is similar to Knuth's ordered hash tables except that the criterion for bumping a

key does not depend on a direct relationship between the keys. Since both the worst case and the

variation in the number of probes is reduced dramatically, an interesting variation is to probe the

table starting at the expected successful probe value and then expand from that position in both

directions.[14]

External Robin Hashing is an extension of this algorithm where the table is stored

in an external file and each table position corresponds to a fixed-sized page or bucket

withB records.[15]

2-choice hashing

2-choice hashing employs 2 different hash functions, h1(x) and h2(x), for the hash table. Both

hash functions are used to compute two table locations. When an object is inserted in the table,

then it is placed in the table location that contains fewer objects (with the default being the h1(x)

table location if there is equality in bucket size). 2-choice hashing employs the principle of

thepower of two choices.
http://en.wikipedia.org/wiki/Cuckoo_hashinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-13http://en.wikipedia.org/wiki/Hash_table#cite_note-13http://en.wikipedia.org/wiki/Hash_table#cite_note-13http://en.wikipedia.org/wiki/Hash_table#cite_note-14http://en.wikipedia.org/wiki/Hash_table#cite_note-14http://en.wikipedia.org/wiki/Hash_table#cite_note-14http://en.wikipedia.org/wiki/Hash_table#cite_note-15http://en.wikipedia.org/wiki/Hash_table#cite_note-15http://en.wikipedia.org/wiki/Hash_table#cite_note-15http://en.wikipedia.org/wiki/2-choice_hashinghttp://en.wikipedia.org/w/index.php?title=Power_of_two_choices&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=Power_of_two_choices&action=edit&redlink=1http://en.wikipedia.org/wiki/2-choice_hashinghttp://en.wikipedia.org/wiki/Hash_table#cite_note-15http://en.wikipedia.org/wiki/Hash_table#cite_note-14http://en.wikipedia.org/wiki/Hash_table#cite_note-13http://en.wikipedia.org/wiki/Cuckoo_hashing


34/46

Hopscotch hashing

Another alternative open-addressing solution is hopscotch hashing,[16]

which combines the

approaches ofcuckoo hashing and linear probing, yet seems in general to avoid their limitations.

In particular it works well even when the load factor grows beyond 0.9. The algorithm is well

suited for implementing a resizable concurrent hash table.

The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original

hashed bucket, where a given entry is always found. Thus, search is limited to the number of

entries in this neighborhood, which is logarithmic in the worst case, constant on average, and

with proper alignment of the neighborhood typically requires one cache miss. When inserting an

entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this

neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an

unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is

outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar

to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the

neighborhood, instead of items being moved out with the hope of eventually finding an empty

slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the

reconfigurable accelerator for the

Documents