b+-trees and hashing. model of computation data stored on disk(s) minimum transfer unit: page = b...

B+-trees and Hashing

Model of Computation

Data stored on disk(s) Minimum transfer unit: page = b bytes or B records (or

block) If r is the size of a record,

then B=b/r N records -> N/B pages I/O complexity: in number

of pages

CPU Memory

Disk

I/O complexity

An ideal index has space O(N/B), update overhead O(1) or O(logB(N/B)) and search complexity O(a/B) or O(logB(N/B) + a/B)

where a is the number of records in the answer

But, sometimes CPU performance is also important… minimize cache misses -> don’t waste CPU cycles

Buffer manager- Context

Query Optimizationand Execution

Relational Operators

Files and Access Methods

Buffer Management

Disk Space Management

DB

Buffer Management

Keep pages in a part of memory (buffer), read directly from there What happens if you need to bring a new page into buffer and

buffer is full: you have to evict one page Replacement policy:

LRU : Least Recently Used (CLOCK) MRU: Most Recently Used Toss-immediate : remove a page if you know that you will not

need it again Pinning (needed in recovery, index based processing,etc) Other DB specific RPs: DBMIN, LRU-k, 2Q

Buffer Management in a DBMS

Data must be in RAM for DBMS to operate on it! Buffer Mgr hides the fact that not all data is in RAM

DB

MAIN MEMORY

DISK

disk page

free frame

Page Requests from Higher Levels

BUFFER POOL

choice of frame dictatedby replacement policy

When a Page is Requested ...

Buffer pool information table contains: <frame#, pageid, pin_count, dirty>

If requested page is not in pool and no free frame:

Choose a frame for replacement.Only “un-pinned” pages are candidates!

If frame is “dirty”, write it to disk Read requested page into chosen frame

Pin the page and return its address.

If requests can be predicted (e.g., sequential scans) pages can be pre-fetched several pages at a time!

More on Buffer Management Requestor of page must eventually unpin it, and indicate

whether page has been modified: dirty bit is used for this.

Page in pool may be requested many times, a pin count is used. To pin a page, pin_count++ A page is a candidate for replacement iff pin count == 0

(“unpinned”) CC & recovery may entail additional I/O when a frame is

chosen for replacement. Write-Ahead Log protocol; more later!

Buffer Replacement Policy

Frame is chosen for replacement by a replacement policy: Least-recently-used (LRU), MRU,

Clock, etc. Policy can have big impact on # of

I/O’s; depends on the access pattern.

LRU Replacement Policy

Least Recently Used (LRU) for each page in buffer pool, keep track of time

when last unpinned replace the frame which has the oldest (earliest)

time very common policy: intuitive and simple

Works well for repeated accesses to popular pages Problems?

“Clock” Replacement Policy

• An approximation of LRU• Arrange frames into a cycle, store one reference bit per frame

• Can think of this as the 2nd chance bit• When pin count reduces to 0, turn on ref. bit• When replacement necessary

do for each page in cycle {if (pincount == 0 && ref bit is on)

turn off ref bit;else if (pincount == 0 && ref bit is off)

choose this page for replacement;} until a page is chosen;

A(1)

B(p)

C(1)

D(1)

B+-tree

Records must be ordered over an attribute,

SSN, Name, etc. Queries: exact match and range

queries over the indexed attribute: “find the name of the student with ID=087-34-7892” or “find all students with gpa between 3.00 and 3.5”

B+-tree:properties

Insert/delete at log F (N/B) cost; keep tree height-balanced. (F = fanout)

Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries/pointers.

Two types of nodes: index nodes and data nodes; each node is 1 page (disk based method)

Example

Root

100

120

150

180

30

40

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

41

55

57

81

95

to keys to keys to keys to keys

< 57 57 k<81 81k<95 95

Index node

Data node5

7

81

95

To r

eco

rd

wit

h k

ey 5

7

To r

eco

rd

wit

h k

ey 8

1

To r

eco

rd

wit

h k

ey 8

5

From non-leaf node

to next leaf

in sequence

Struct {Key real;Pointr *long;} entry;

Node entry[B];

Insertion Find correct leaf L. Put data entry onto L.

If L has enough space, done! Else, must split L (into L and a new node L2)

Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L.

This can happen recursively To split index node, redistribute entries evenly, but

push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height.

Tree growth: gets wider or one level taller at top.

Deletion

Start at root, find leaf L where entry belongs. Remove the entry.

If L is at least half-full, done! If L has only d-1 entries,

Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).

If re-distribution fails, merge L and sibling. If merge occurred, must delete entry

(pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing

height.

Characteristics Optimal method for 1-d range queries:Space: O(N/B), Updates: O(logB(N/B)),

Query:O(logB(N/B) + a/B) Space utilization: 67% for random input B-tree variation: index nodes store also

pointers to records

Example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

Range[32, 160]

Other issues

Internal node architecture [Lomet01]: Reduce the overhead of tree traversal.

Prefix compression: In index nodes store only the prefix that differentiate consecutive sub-trees. Fanout is increased.

Cache sensitive B+-tree Place keys in a way that reduces the cache

faults during the binary search in each node. Eliminate pointers so a cache line contains

more keys for comparison.

Exact Match QueriesExact Match Queries

Yes! Hashing static hashing dynamic hashing

B+-tree is perfect, but....

to answer a selection query (ssn=10) needs to traverse a full path.

In practice, 3-4 block accesses (depending on the height of the tree, buffering)

Any better approach?

HashingHashing

Hash-based indexes are best for equality selections. Cannot support range searches.

Static and dynamic hashing techniques exist.

Static HashingStatic Hashing

# primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.

h(k) MOD N= bucket to which data entry with key k belongs. (N = # of buckets)

h(key) mod N

hkey

Primary bucket pages

Overflow pages

10

N-1

Static Hashing (Contd.)Static Hashing (Contd.)

Buckets contain data entries. Hash fn works on search key field of record

r. Use its value MOD N to distribute values over range 0 ... N-1.

h(key) = (a * key + b) usually works well. a and b are constants; lots known about how to tune h.

Long overflow chains can develop and degrade performance.

Extendible and Linear Hashing: Dynamic techniques to fix this problem.

Linear HashingLinear Hashing

A dynamic hashing scheme that handles the problem of long overflow chains without using a directory.

Directory avoided in LH by using temporary overflow pages, and choosing the bucket to split in a round-robin fashion.

When any bucket overflows split the bucket that is currently pointed to by the “Next” pointer and then increment that pointer to the next bucket.

Linear Hashing – The Main IdeaLinear Hashing – The Main Idea

Use a family of hash functions h0, h1, h2, ...

hL(key) = h(key) mod(2LN)

N = initial # buckets

h is some hash function

L level, starts at 0

Each time we use at most two hash functions: hL(key)

and hL+1(key)

hL+1 doubles the range of hL

Linear Hashing (Contd.)Linear Hashing (Contd.)

Algorithm proceeds in `rounds’. Current round number is L.

There are NL (= N * 2L) buckets at the beginning of a round

Buckets 0 to Next-1 have been split; Next to NL have not been split yet this round.

Round ends when all initial buckets have been split (i.e. Next = NL).

To start next round:L++;

Next = 0;

Linear Hashing - InsertLinear Hashing - Insert

Find appropriate bucket If bucket to insert into is full:

Add overflow page and insert data entry. Split Next bucket and increment Next.

Note: This is likely NOT the bucket being inserted to!!! to split a bucket, create a new bucket and use hLevel+1 to re-

distribute entries.

Since buckets are split round-robin, long overflow chains don’t develop!

Example: Insert 43Example: Insert 43

L=0, N=4

Next=0

PRIMARYPAGES

Level=0Next=1

PRIMARYPAGES

OVERFLOWPAGES44 3632

259 5

14 18 10 30

31 35 117

44 36

32

259 5

14 18 10 30

31 35 117 43

h0(x) = x mod 4h1(x) = x mod 8

Example: End of a RoundExample: End of a Round

22*

Next=3

Level=0, Next = 3PRIMARYPAGES

OVERFLOWPAGES

32*

9*

5*

14*

25*

66* 10*18* 34*

35*31* 7* 11* 43*

44*36*

37*29*

30*

37*

Next=0

PRIMARYPAGES

OVERFLOWPAGES

32*

9* 25*

66* 18* 10* 34*

35* 11*

44* 36*

5* 29*

43*

14* 30* 22*

31*7*

50*

Insert 50 Level=1, Next = 0

LH Search AlgorithmLH Search Algorithm

To find bucket for data entry r, find hL(r):

If hL(r) >= Next (i.e., hL(r) is a bucket that hasn’t been involved in a split this round) then r belongs in that bucket for sure.

Else, r could belong to bucket hL(r) or bucket hL(r) + NLevel must apply hL+1(r) to find out.

b+-trees and hashing. model of computation data stored on disk(s) minimum transfer unit: page = b...

Documents