1 dept. of cis, temple univ. cis661 – principles of data management v. megalooikonomou indexing...

57
1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Upload: angel-randall

Post on 08-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

Indexing- overview ISAM and B-trees Hashing Hashing vs B-trees Indices in SQL Advanced topics: dynamic hashing multi-attribute indexing

TRANSCRIPT

Page 1: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

1

Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management

V. MegalooikonomouIndexing and Hashing II

(based on slides by C. Faloutsos at CMU)

Page 2: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

General Overview - rel. model Relational model - SQL

Formal & commercial query languages

Functional Dependencies Normalization Physical Design Indexing

Page 3: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees Hashing Hashing vs B-trees Indices in SQL Advanced topics:

dynamic hashing multi-attribute indexing

Page 4: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

(Static) HashingProblem: “find EMP record with ssn=123”

Q: What if disk space was free, and time was at premium?

Page 5: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

HashingA: Brilliant idea: key-to-address transformation:

#0 page

#123 page

#999,999,999

123; Smith; Main str

Page 6: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

HashingSince space is NOT free: use M, instead of 999,999,999 slots hash function: h(key) = slot-id

#0 page

#123 page

#999,999,999

123; Smith; Main str

Page 7: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

HashingTypically: each hash bucket is a page, holding

many records:

#0 page

#h(123)

M

123; Smith; Main str

Page 8: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

HashingNotice: could have clustering, or non-clustering

versions:

#0 page

#h(123)

M

123; Smith; Main str.

Page 9: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

123

...

HashingNotice: could have clustering, or non-clustering

versions:

#0 page

#h(123)

M

...EMP file

123; Smith; Main str.

...

234; Johnson; Forbes ave

345; Tompson; Fifth ave

...

Page 10: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees hashing

hashing functions size of hash table collision resolution

Hashing vs B-trees Indices in SQL Advanced topics:

Page 11: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Design decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method

Page 12: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Design decisions - functions Goal:

uniform spread of keys over hash buckets Popular choices:

Division hashing

Multiplication hashing

Page 13: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Division hashingh(x) = (a*x+b) mod M

eg., h(ssn) = (ssn) mod 1,000 gives the last three digits of ssn

M: size of hash table - choose a prime number, defensively (why?)

Page 14: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

eg., M=2; hash on driver-license number (dln), where the last digit is ‘gender’ (0/1 = M/F)

in an army unit with predominantly male soldiers

Thus: avoid cases where M and keys have common divisors -- prime M guards against that!

Division hashing

Page 15: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Multiplication hashingh(x) = [ fractional-part-of ( x * φ ) ] * M

φ: golden ratio ( 0.618... = ( sqrt(5)-1)/2 ) In general, we need an irrational number Advantage: M need not be a prime number But φ must be irrational

Page 16: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Other hashing functions quadratic hashing (bad) ... conclusion: use division hashing

Page 17: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Design decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method

Page 18: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Size of hash table eg., 50,000 employees, 10

employee-records / page Q: M=?? pages/buckets/slots

Page 19: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Size of hash table eg., 50,000 employees, 10

employees/page Q: M=?? pages/buckets/slots A: utilization ~ 90% and

M: prime numberEg., in our case: M= closest prime to

50,000/10 / 0.9 = 5,555

Page 20: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Design decisions1) formula h() for hashing function2) size of hash table M3) collision resolution method

Page 21: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution Q: what is a ‘collision’? A: ??

Page 22: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution#0 page

#h(123)

M

123; Smith; Main str.

Page 23: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution Q: what is a ‘collision’? A: ?? Q: why worry about

collisions/overflows? (recall that buckets are ~90% full)

A: e.g. bank account balances between $0 and $10,000 and between $90,000 and 100,000

Page 24: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution open addressing

linear probing (ie., put to next slot/bucket)

re-hashing separate chaining (ie., put links to

overflow pages)

Page 25: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution#0 page

#h(123)

M

123; Smith; Main str.

linear probing:

Page 26: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution#0 page

#h(123)

M

123; Smith; Main str.

re-hashing

h1()

h2()

Page 27: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Collision resolution

123; Smith; Main str.

separate chaining

Page 28: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Design decisions - conclusions function: division hashing

h(x) = ( a*x+b ) mod M size M: ~90% util.; prime number. collision resolution: separate

chaining easier to implement (deletions!); no danger of becoming full

Page 29: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics:

dynamic hashing multi-attribute indexing

Page 30: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Hashing vs B-trees:Hashing offers speed ! ( O(1) avg. search time)

..but B-trees offer:

Page 31: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Hashing vs B-trees:..but B-trees offer: key ordering:

range queries proximity queries sequential scan

O(log(N)) guarantees for search, ins./del. graceful growing/shrinking

Page 32: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Hashing vs B-trees:

thus: B-trees are implemented in most

systems

footnotes: hashing is not (why not?)

Page 33: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics:

dynamic hashing multi-attribute indexing

Page 34: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing in SQL create index <index-name> on <relation-

name> (<attribute-list>) create unique index <index-name> on

<relation-name> (<attribute-list>) (in the case that the search key is a

candidate key) drop index <index-name>

Page 35: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing in SQL eg.,

create index ssn-indexon STUDENT (ssn)

or (eg., on TAKES(ssn,cid, grade) ):create index sc-indexon TAKES (ssn, c-id)

Page 36: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics: (theoretical interest)

dynamic hashing multi-attribute indexing

Page 37: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Problem with static hashing problem: overflow? problem: underflow? (under-utilization)

Page 38: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Solution: Dynamic/extendible hashing

Idea: shrink / expand hash table on demand.. ...dynamic hashing

Details: how to grow gracefully, on overflow?

Many solutions - One of them: ‘extendible hashing’

Page 39: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashing#0 page

#h(123)

M

123; Smith; Main str.

Page 40: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashing#0 page

#h(123)

M

123; Smith; Main str.

solution:

split the bucket in two

Page 41: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingin detail: keep a directory, with ptrs to hash-

buckets Q: how to divide contents of bucket in

two? A: hash each key into a very long bit

string; keep only as many bits as neededEventually:

Page 42: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingdirectory

00...01...

10...

11...

10101...

10110...

1101...

10011...

0111...0001...

101001...

Page 43: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingdirectory

00...01...

10...

11...

10101...

10110...

1101...

10011...

0111...0001...

101001...

Page 44: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingdirectory

00...01...

10...

11...

10101...

10110...

1101...

10011...

0111...0001...

101001...

split on 3-rd bit

Page 45: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingdirectory

00...01...

10...

11...

1101...

10011...

0111...0001...

101001...10101...

10110...

new page / bucket

Page 46: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashingdirectory (doubled)

1101...

10011...

0111...0001...

101001...10101...

10110...

new page / bucket

000...001...

010...

011...

100...101...

110...

111...

Page 47: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashing

00...01...

10...

11...

10101...

10110...

1101...

10011...

0111...0001...

101001...

000...001...

010...

011...

100...101...

110...

111...

1101...

10011...

0111...0001...

101001...10101...

10110...

BEFORE AFTER

Page 48: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Extendible hashing

Summary: directory doubles on demand or halves, on shrinking files needs ‘local’ and ‘global’ depth (see book) Mainly, of theoretical interest - same for

‘linear hashing’ of Litwin ‘order preserving’ ‘perfect hashing’ (no collisions!)

Page 49: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees Hashing Hashing vs B-trees Indices in SQL Advanced topics:

dynamic hashing multi-attribute indexing

Page 50: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access How to support queries on multiple

attributes, like grade>=3 and course=‘415’

Major motivation: Geographic Information systems (GIS)

Page 51: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access

x

y

Page 52: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key accessTypical query: Find cities within x miles from

Philadelphiathus, we want to store nearby cities

on the same disk page:

Page 53: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access

x

y

Page 54: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access

x

y

Page 55: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access - R-trees

x

y

Page 56: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

multiple-key access - R-trees R-trees: very successful for GIS (along with ‘z-ordering’) more details: at ‘advanced topics’, later

Page 57: 1 Dept. of CIS, Temple Univ. CIS661 – Principles of Data Management V. Megalooikonomou Indexing and Hashing II (based on slides by C. Faloutsos at CMU)

Indexing- overview ISAM and B-trees hashing Hashing vs B-trees Indices in SQL Advanced topics:

dynamic hashing multi-attribute indexing

industry workhorse