hash-based indexes

CPSC 461 Instructor: Marina Gavrilova

GoalGoal of today’s lecture is to introduce

concept of “hash-index”. Hashing is an alternative to tree structure that allows near constant access time to ANY record in a very large database.

Presentation OutlineIntroductionDid you know that?Static hashing definition and methodsGood hash functionCollision resolution (open addressing, chaining)Extendible hashingLinear hashing Useful links and current market trendsSummary

4

Approaches to Search

1.Sequential and list methods (lists, tables, arrays).

2. Direct access by key value (hashing)

3. Tree indexing methods.

Introduction to Hashing

5

Definition

Hashing is the process of mapping a key value to a position in a table.

A hash function maps key values to positions.

A hash table is an array that holds the records.

Searching in a hash table can be done in O(1) regardless of the hash table size.


6


7

Example of Usefullness

10 stock details, 10 table positions

Stock numbers are between 0 and 1

1000. Using the whole stock

numbers may require 1000 storage

locations and this is an obvious waste of

memory.


8

Applications of Hashing

Compilers use hash tables to keep track of declared variables

A hash table can be used for on-line spelling checkers — if

misspelling detection (rather than correction) is important, an entire

dictionary can be hashed and words checked in constant time

Game playing programs use hash tables to store seen positions,

thereby saving computation time if the position is encountered

again

Hash functions can be used to quickly check for inequality — if

two elements hash to different values they must be different

Storing sparse data

Introduction of Hashing

9

Did you know that?Cryptography was once known only to the key people in

the the National Security Agency and a few academics. Until 1996, it was illegal to export strong cryptography

from the United States.Fast forward to 2006, and the

Payment Card Industry Data Security Standard (PCI DSS) requires merchants to encrypt cardholder information. Visa and MasterCard can levy fines of up to $500,000 for not complying!

Among methods recommended are:Strong one-way hash functions (hashed indexes) Truncation Index tokens and pads (pads must be securely

stored) Strong cryptography [Hashing for fun and profit: Demystifying encryption for PCI

DSSRoger Nebel]

http://www.nsa.gov/

http://searchsecurity.techtarget.com/topics/0,295493,sid14_tax303586,00.html

10

Did you know that?Transport Layer Security protocol on networks (TLS)

uses the Rivest, Shamir, and Adleman (RSA) public key algorithm for the TLS key exchange and authentication, and the Secure Hashing Algorithm 1 (SHA-1) for the key exchange and hashing.

[System cryptography: Use FIPS compliant algorithms for encryption, hashing, and signing, Microsoft TechNews, 2005]

11

Did you know that? Spatial hashing studies performed at Microsoft

Research, Redmond combine hashing with computer graphics to create a new set of tools for rendering, mesh reconstruction, and collision optimization (see public poster by Hugues Hoppe on the next slide)

Perfect Spatial Hashing Sylvain Lefebvre Hugues Hoppe(Microsoft Research)

Hash table Offset table Hash table Offset table

Vector images Sprite maps

Alpha compression

3D textures 3D painting

Simulation Collision detection

2D 3D

1282 382 182 1283 353 193

ApplicationsHash function

p

p S

( )s h p

modq p r

[ ]q

Domain Hashtable H

Offsettable

( )h p p p

• Perfect hash on multidimensional data• No collisions ideal for GPU• Single lookup into a small offset table

• Offsets only ~4 bits per defined data• Access only ~4 instructions on GPU• Optimized spatial coherence

10243, 46MB, 530fps 20483, 56MB, 200fps

10243, 12MB, 140fps 2563, 100fps

10242, 500KB, 700fps +900KB, 200fps

(modulo table sizes)

0.9bits/pixel, 800fps

1.8%

24372

83333

• We design a perfect hash function to losslessly pack sparse data while retaining efficient random access:

• Simply:

453

nearest: 7.5MB, 370fps

11632

13

Did you know that?Combining hashing and encryption provides a

much stronger tool for database and password protection.

http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/

[Security Briefs, SMDN Magazine]



14

Hash Functions

Hashing is the process of chopping up the key and

mixing it up in various ways in order to obtain an index

which will be uniformly distributed over the range of

indices - hence the ‘hashing’.

There are several common ways of doing this:

Truncation Folding Modular Arithmetic

Hash Functions

15

Hash Functions – Truncation

Truncation is a method in which parts of the key are ignored and

the remaining portion becomes the index. - For this, we take the given key and produce a hash location

by taking portions of the key (truncating the key).

Example – If a hash table can hold 1000 entries and an 8-digit number is used as key, the 3rd, 5th and 7th digits

starting from the left of the key could be used to produce the

index. - e.g. .. Key is 62538194 and the hash location is 589. - Advantage: Simple and easy to implement.

Problems: Clustering and repetition.

Hash Functions

16

Hash Functions – Folding

Folding breaks the key into several parts and combines the parts to form an index.

- The parts may be recombined by addition, subtraction, multiplications and may have to be truncated as well.

- Such a process is usually better than truncation by itself since it produces a better distribution: all of the digits in the key are considered.

- Using a key 62538194 and breaking it into 3 numbers using the first 3 and the last 2 digits produced 625, 381 and 94. These could be added to get 1100 which could be truncated to 100.

They could be also be multiplied together and then three digits chosen

from the middle of the number produced.

Hash Functions

17

Hash Functions – (Modular Arithmetic)

Modular Arithmetic process essentially assures that the index produced is within a specified range. For this, the key is converted to an integer which is divided by the range of the index with the resulting

function being the value of the remainder.

Uses: biometrics, encryption, compression - If the value of the modulus is a prime number, the

distribution of indices obtained is quite uniform. - A table whose size is some number which has many

factors provides the possibility of many indices which are the same, so the size should be a prime number.

Hash Functions

18

Good Hash Functions

Hash functions which use all of the key are almost always better than those which use only some of the key.

- When only portions are used, information is lost and therefore the

number of possibilities for the final key are reduced.

- If we deal with the integer its binary form, then

the number of pieces that can be manipulated by the hash

function is greatly increased.

Hash Functions

19

Collision

It is obvious that no matter what function is used, the possibility

exists that the use of the function will produce an index which is a

duplicate of an index which already exists. This is a Collision.

Collision resolution strategy:

- Open addressing: store the key/entry in a different position

- Chaining: chain together several keys/entries in each position

Collision Resolution

20

Collision - Example

- - Hash table size 11 - - Hash function: key mod hash size

So, the new positions in the hash table are:

Some collisions occur with this hash function.


21

Collision Resolution – Open Addressing

Resolving collisions by open addressing is resolving the problem by

taking the next open space as determined by rehashing the key

according to some algorithm.

Two main open addressing collision resolution techniques:

- - Linear probing: increase by 1 each time [mod table size!]

- - Quadratic probing: to the original position, add 1, 4, 9, 16,…

also in some cases key-dependent increment technique is used.


ProbingIf the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found

22


Linear Probingnew position = (current position + 1) MOD

hash sizeExample – Before linear probing:

After linear probing:

Problem – Clustering occurs, that is, the used spaces tend to appear in groups which tends to grow and thus increase the search time to reach an open space.


23


In order to try to avoid clustering, a method which does not look for

the first open space must be used.

Two common methods are used –

- - Quadratic Probing - - Key-dependent Increments


24


Quadratic Probingnew position = (collision position + j2) MOD

hash size { j =

1, 2, 3, 4, ……}Example – Before quadratic probing:

After quadratic probing:

Problem – Overflow may occurs when there is still space in the hash table.


25


Key-dependent Increments

This technique is used to solve the overflow problem of the quadratic probing method.

These increments vary according to the key used for the hash

function. If the original hash function results in a good

distribution, then key- dependent functions work quite well for rehashing and

all locations in the table will eventually be probed for a free position.

Key dependent increments are determined by using the key to

calculate a new value and then using this as an increment to determine

successive probes.


26


Key-dependent IncrementsFor example, since the original hash function was key Mod 11, we

might choose a function of key MOD 7 to find the increment. Thus the hash function becomes - -

new position = current position + ( key DIV 11) MOD 11

Example – Before key-dependent increments:

After key-dependent increments:


27


Key-dependent Increments In all of the closed hash functions it is important to

ensure that an increment of 0 does not arise.

If the increment is equal to hash size the same position will be probed all the time, so this value cannot be used.

If we ensure that the hash size is prime and the divisors for the open and closed hash are prime, the rehash function does not produce a 0 increment, then this method will usually access all positions as does the linear

probe. - Using a key-dependent method usually result reduces

clustering and therefore searches for an empty position should not be as long as for the linear method.


28

Collision Resolution – Chaining

Each table position is a linked list Add the keys and entries

anywhere in the list (front easiest)Advantages over open addressing:

- Simpler insertion and removal - Array size is not a limitation

(but should still minimize collisions: make table size roughly equal to expected number of keys and entries)

Disadvantage - Memory overhead is large if

entries are small


29

Collision Resolution – Chaining

Example:Before chaining:

After chaining:


30

In analyzing search efficiency, the average is usually used. Searching with

hash tables is highly dependent on how full the table is since as the table

approaches a full state, more rehashes are necessary. The proportion of the

table which is full is called the Load Factor.

- When collisions are resolved using open addressing, the maximum load

factor is 1. - Using chaining, however, the load factor can be greater

than 1 when the table is full and the linked list attached to each hash

address has more than one element. - Chaining consistently requires fewer probes than open

addressing. - Traversal of the linked list is slow and if the records are

small, it may be just as well to use open addressing. - Chaining is the best under two conditions --- when the

number of unsuccessful searches is large or when the records are

large. - Open addressing would likely be a reasonable choice

when most searches are likely to be successful, the load factor is moderate and

the records are relatively small.

Analysis of Searching using Hash Tables

31

Average number of probes for different collision resolution methods:

[ The values are for large hash tables, in this case larger than 500]


32

When are other representations more suitable than hashing:

Hash tables are very good if there is a need for many searches in a

reasonably stable database

Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed — in this case, trees are better for indexing

Also, hashing is very slow for any operations which require the entries to be sorted (e.g. query to Find the minimum key)


33

A perfect hashing function maps a key into a unique address. If the range of potential addresses is the same as the number of keys, the function is a minimal (in space) perfect hashing function.

What makes perfect hashing distinctive is that it is a process for mapping a key space to a unique address in a smaller address space, that is

hash (key) unique address

Not only does a perfect hashing function improve retrieval performance, but a minimal perfect hashing function would provide 100 percent storage utilization.

Perfect Hashing

34

Process of creating a perfect hash function

A general form of a perfect hashing function is:

p.hash (key) =(h0(key) + g[h1(key)] + g[h2(key)] mod N

Perfect Hashing

35

In Cichelli’s algorithm, the component functions are:

h0 = length (key)

h1 = first_character (key)

h2 = second_character (key)

and g = T (x)

where T is the table of values associated with individual characters x which may apply in a key.

The time consuming part of Cichelli’s algorithm is determining T.

Cichelli’s Algorithm

36

When we apply the Cichelli’s perfect hashing function to the keyword begin using table 1, we can get –

The keyword begin would be stored in location 33. Since the hash values run from 2 through 37 for this set of data, the hash function is a minimal perfect hashing function.

Cichelli’s AlgorithmTable 1: Values associated with the characters of the Pascal reserved words

37

Links for interactive hashing examples: http://www.engin.umd.umich.edu/CIS/course.des/cis350/hashing/WEB/HashApplet.htm

http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.html

http://www.cse.yorku.ca/~aaw/Hang/hash/Hash.html

http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html

Some Links to Hashing Animation

Hashing as Database IndexThe basic idea is to use hashing function, which

maps a search key value(of a field) into a record or bucket of records.

As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k>

Hash-based indexes are best for equality selections. Cannot support range searches.

Static and dynamic hashing techniques exist; trade-offs similar to ISAM vs. B+ trees.

Static Hashing# primary pages fixed, allocated

sequentially, never de-allocated; overflow pages allowed if needed.

h(k) mod M = bucket to which data entry with key k belongs. (M = # of buckets)

h(key) mod N

hkey

Primary bucket pages Overflow pages

20

N-1

Static Hashing (Contd.)Buckets contain data entries.Hash function depends on search key field of record

r. Must distribute values over range 0 ... M-1 (table size).h(key) = (a * key + b) mod M usually works well.a and b are constants; lots known about how to tune h.

Long overflow chains can develop and degrade performance. Extendible and Linear Hashing: Dynamic techniques to

fix this problem.

Rule of thumbTry to keep space utilization between 50%

and 80%If <50%, wasting spaceIf >80%, overflow significant

Depends on how good hashing function is&On # keys/bucket

42

Extendible Hashing (Fagin et. al. 1979) Expandable hashing (Knott 1971) Dynamic Hashing (Larson 1978)

Hash Functions for Extendible Hashing

43

Assume that a hashing technique is applied to a dynamically changing file composed of buckets, and each bucket can hold only a fixed number of items.

Extendible hashing accesses the data stored in buckets indirectly through an index that is dynamically adjusted to reflect changes in the file.

The characteristic feature of extendible hashing is the organization of the index, which is an expandable table.

Extendible Hashing

44

A hash function applied to a certain key indicates a position in the index and not in the file (or table or keys). Values returned by such a hash function are called pseudokeys.

The database/file requires no reorganization when data are added to or deleted from it, since these changes are indicated in the index.

Only one hash function h can be used, but depending on the size of the index, only a portion of the added h(K) is utilized.

A simple way to achieve this effect is by looking at the address into the string of bits from which only the i leftmost bits can be used.

The number i is the depth of the directory. In figure 1(a) (in the next slide), the depth is equal to two.

Extendible Hashing

45

Figure 1. An example of extendible hashing (Drozdek Textbook)

Example

Extendible Hashing as IndexSituation: Bucket (primary page) becomes full.

Why not re-organize file by doubling # of buckets?Reading and writing all pages is expensive!Idea: Use directory of pointers to buckets,

double # of buckets by doubling the directory, splitting just the bucket that overflowed!

Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. No overflow page!

Trick lies in how hash function is adjusted!

ExampleDirectory is array of size 4.To find bucket for r, take

last `global depth’ # bits of h(r); we denote r by h(r). If h(r) = 5 = binary 101,

it is in bucket pointed to by 01.

Insert: If bucket is full, split it (allocate new page, re-distribute). If necessary, double the directory. (As we will see, splitting a bucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)

13*00

01

10

11

2

2

2

2

2

LOCAL DEPTH

GLOBAL DEPTH

DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

DATA PAGES

10*

1* 21*

4* 12* 32* 16*

15* 7* 19*

5*

Insert h(r)=20 (Causes Doubling)

20*

00011011

2 2

2

2

LOCAL DEPTH 2

2

DIRECTORY

GLOBAL DEPTHBucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

1* 5* 21*13*

32*16*

10*

15* 7* 19*

4* 12*

19*

2

2

2

000001010

011100101

110111

3

3

3DIRECTORY

Bucket A

Bucket B

Bucket C

Bucket D

Bucket A2(`split image'of Bucket A)

32*

1* 5* 21*13*

16*

10*

15* 7*

4* 20*12*

LOCAL DEPTH

GLOBAL DEPTH

Points to Note20 = binary 10100. Last 2 bits (00) tell us r belongs

in A or A2. Last 3 bits needed to tell which.Global depth of directory: Max # of bits needed to

tell which bucket an entry belongs to.Local depth of a bucket: # of bits used to determine if

an entry belongs to this bucket.When does bucket split cause directory doubling?

Before insert, local depth of bucket = global depth. Insert causes local depth to become > global depth; directory is doubled by copying it over and `fixing’ pointer to split image page. (Use of least significant bits enables efficient doubling via copying of directory!)

00

01

10

11

2

Why use least significant bits in directory? Allows for doubling via copying!

000

001

010

011

3

100

101

110

111

vs.

0

1

1

6*6*

6*

6 = 110

00

10

01

11

2

3

0

1

1

6*6* 6*

6 = 110000

100

010

110

001

101

011

111

Least Significant Most Significant

Comments on Extendible HashingIf directory fits in memory, equality search answered

with one disk access; else two.100MB file, 100 bytes/rec, 4K pages contains

1,000,000 records (as data entries) and 25,000 directory elements; chances are high that directory will fit in memory.

Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large.

Multiple entries with same hash value cause problems!Delete: If removal of data entry makes bucket

empty, can be merged with `split image’. If each directory element points to same bucket as its split image, can halve directory.

52

Expandable Hashing Similar idea to an extendible hashing. But binary tree is used to store an index on the buckets.

Dynamic Hashing

multiple binary trees are used. Outcome: - To shorten the search. - Based on the key --- select what tree to search.

Hybrid methods

Linear HashingThis is another dynamic hashing scheme, an

alternative to Extendible Hashing.LH handles the problem of long overflow chains

without using a directory, and handles duplicates. Idea: Use a family of hash functions h0, h1, h2, ...

hi(key) = h(key) mod(2iN); N = initial # bucketsh is some hash function (range is not 0 to N-1)If N = 2d0, for some d0, hi consists of applying h and

looking at the last di bits, where di = d0 + i.hi+1 doubles the range of hi (similar to directory

doubling)

Linear Hashing (Contd.)Directory avoided in LH by using overflow pages,

and choosing bucket to split round-robin.Splitting proceeds in `rounds’. Round ends when all

NR initial (for round R) buckets are split. Buckets 0 to Next-1 have been split; Next to NR yet to be split.

Current round number is Level.Search: To find bucket for data entry r, find hLevel(r):

If hLevel(r) in range `Next to NR’ , r belongs here.Else, r could belong to bucket hLevel(r) or bucket

hLevel(r) + NR; must apply hLevel+1(r) to find out.

Overview of LH File In the middle of a round.

Levelh

Buckets that existed at thebeginning of this round:

this is the range of

NextBucket to be split

of other buckets) in this round

Levelh search key value )(

search key value )(

Buckets split in this round:If is in this range, must useh Level+1

`split image' bucket.to decide if entry is in

created (through splitting`split image' buckets:

Linear Hashing (Contd.)Insert: Find bucket by applying hLevel / hLevel+1:

If bucket to insert into is full:Add overflow page and insert data entry.(Maybe) Split Next bucket and increment Next.

Can choose any criterion to `trigger’ split. Since buckets are split round-robin, long

overflow chains don’t develop!Doubling of directory in Extendible Hashing is

similar; switching of hash functions is implicit in how the # of bits examined is increased.

Example of Linear HashingOn split, hLevel+1 is used

to re-distribute entries.

0hh

1

(This infois for illustrationonly!)

Level=0, N=4

00

01

10

11

000

001

010

011

(The actual contentsof the linear hashedfile)

Next=0PRIMARY

PAGES

Data entry rwith h(r)=5

Primary bucket page

44* 36*32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

0hh

1

Level=0

00

01

10

11

000

001

010

011

Next=1

PRIMARYPAGES

44* 36*

32*

25*9* 5*

14* 18*10*30*

31*35* 11*7*

OVERFLOWPAGES

43*

00100

Example: End of a Round

0hh1

22*

00

01

10

11

000

001

010

011

00100

Next=3

01

10

101

110

Level=0PRIMARYPAGES

OVERFLOWPAGES

32*

9*

5*

14*

25*

66* 10*18* 34*

35*31* 7* 11* 43*

44*36*

37*29*

30*

0hh1

37*

00

01

10

11

000

001

010

011

00100

10

101

110

Next=0

Level=1

111

11

PRIMARYPAGES

OVERFLOWPAGES

11

32*

9* 25*

66* 18* 10* 34*

35* 11*

44* 36*

5* 29*

43*

14* 30* 22*

31*7*

50*

LH Described as a Variant of EHThe two schemes are actually quite similar:

Begin with an EH index where directory has N elements.

Use overflow pages, split buckets round-robin.First split is at bucket 0. (Imagine directory being

doubled at this point.) But elements <1,N+1>, <2,N+2>, ... are the same. So, need only create directory element N, which differs from 0, now. When bucket 1 splits, create directory element N+1, etc.

So, directory can double gradually. Also, primary bucket pages are created in order. If they are allocated in sequence too (so that finding i’th is easy), we actually don’t need a directory!

Useful Links

http://www.cs.ucla.edu/classes/winter03/cs143/l1/handouts/hash.pdf

http://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htm

http://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdf



SummaryHash-based indexes: best for equality searches,

cannot support range searches.Static Hashing can lead to long overflow chains.Extendible Hashing avoids overflow pages by

splitting a full bucket when a new data entry is to be added to it. (Duplicates may require overflow pages.)Directory to keep track of buckets, doubles periodically.Directoryless schemes (linear dynamic hashing)

availableCan get large with skewed data; additional I/O if this

does not fit in main memory.

SummaryLinear Hashing avoids directory by splitting

buckets round-robin, and using overflow pages. Overflow pages not likely to be long.Duplicates handled easily.Space utilization could be lower than Extendible

Hashing, since splits not concentrated on `dense’ data areas.

Check ListWhat is the intuition behind hash-structured

indexes?Why are they especially good for equality

searches but useless for range selections?What is Extendible Hashing? How does it handle search, insert, and delete?What is Linear Hashing? What are the similarities and differences between

Extendible and Linear Hashing?How does perfect hash function works?

hash-based indexes

Documents

hash tables

strong cryptography

concept of hash

elements hash

hash table size

spatial hashing studies

static hashing definition

hash function maps key