index structures [13]

Dr. Kalpakis

CMSC 661, Principles of Database Systems

http://www.csee.umbc.edu/~kalpakis/Courses/661

Index Structures [13]

CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 2

Terms

Search key

Values for a collection of fields used to find records with “matching” values of these fields

Index entry

Pair (search key value, database pointer)

Index structure

A data structure that allows us to locate a record that satisfies the search key conditions

Primary vs secondary index

Primary indexes return database address of record

Secondary return value of primary key of record

Dense vs sparse

Clustered vs non-clustered


Indexes on sequential files

A index file consists of index entry records

A sorted file of records is called a sequential file

the sorted file itself is called the data file

Records could be stored in

contiguous sequence blocks (can do binary search!)

a linked list of blocks

Index file is sorted in the same order as the data file

Dense indexes

Have an index entry for each record in the data file

Index file likely much smaller than data file

May fit in main memory

Useful to find records and answer some queries without reading the (likely

larger) data file


Indexes on sequential files

Three major approaches to reorganize a data file

Use overflow blocks

Insert new blocks in the sequential order

Move records between adjacent block

Since an index file, dense or sparse, is sorted on its search key,

it is a sequential file itself

Apply the same principles to it as for the data file


Secondary indexes

Primary indexes determine the location of the indexed records

Secondary indexes do not determine the location of the indexed

records

Secondary indexes are always dense, typically with duplicates

Indirection for secondary indexes

To avoid repeating the search key for duplicates, we can have for each

distinct search key value a bucket with references to all the records with

that search key value

idea naturally leads to inverted indexes in Information Retrieval

For each word, have a secondary index of documents that contain it

Combine all these indexes into one


Conventional indexes

Advantages

Simple

Index is a sequential file

Good for scans

Disadvantages

Inserts are expensive

Loose sequential order

Become imbalanced


B-trees

An index that organizes its blocks into a tree

A B-tree with parameter n

Is a balanced tree since all leaves are required to be at the same level

Each node has space for n search keys and n+1 database pointers to

blocks of the tree

All internal nodes have more than ceil(n+1/2) children

Each leaf has no less than floor(n+1/2) keys from the data file

Root has at least two children


B-trees

The keys in each node are stored in increasing sorted order

The keys in the leafs

Are copies of keys from the data file

Appear among all the leaves in increasing sorted order

Each key K in a non-leaf node

Is flanked by two pointers to subtrees

The subtree to its left contains keys < K

The subtree to its right contains keys >= K

K is the smallest key appearing among the leaves of the right subtree


B-tree example

Root

100

120

150

180

30

3 5 11

30

35

100

101

110

120

130

150

156

179

180

200

n=3


B-tree example - non-leaf node

57

81

95

to keys < 57 to keys 57 k<81 to keys 81 k<95 to keys 95

from non-leaf node


B-tree example – leaf node

57

81

95

To r

eco

rd

wit

h k

ey

57

wit

h k

ey

81

wit

h k

ey

85

to next leaf in sequence

from non-leaf node


Operations and B-trees

Lookup

Equality queries

Traverse the tree from the root down to an appropriate leaf using the keys

stored at the nodes to decide which subtree(s) to visit

Range queries in which records with search keys in a given range are

sought for

Insertion

Find leaf block B to insert search key K

If B is not full insert K there

If B is full, create new block B’, distribute the keys among the two blocks,

and recursively insert a search key for B’ to the parent of B,

update the keys that appear on the paths from B (and B’) to the root


Operations and B-trees

Deletion

To delete search key K, locate leaf block B that contains K

Delete entry for K from B

If B now has fewer than the required keys/pointers

If sibling node B’ has an extra entry that could give to B, move it to B

Else, coalesce by moving all the entries from B to one of its siblings and

recursively delete the entry for B from its parent

Update the keys along the path from the modified blocks to the root

Coalescing is often not implemented

Too hard for little benefits


B-trees vs Indexed Sequential Files

Concurrency control harder in B-trees

B-trees consume more spaceStatic index saves enough accesses to allow for reorganization

On the other handDBA does not know when to reorganize

DBA does not know how full to load pages of new index

BufferingB-tree has fixed buffer requirements

Static index must read several overflow blocks to be efficient (large & variable size buffers needed for this)

LRU is bad for B-trees (always keep the root block in memory)

B-trees are better


Interesting problems

What is a good index structure when

records tend to be inserted with keys that are larger than existing values?

we want to remove older data

What is the best n?

As disks get bigger, main memories increase, and CPUs get faster?


Static hashing

Maintain an array of buckets, each bucket being a chain of

blocks

Use a hash function to compute the bucket index for record(s)

with given search key K

Search chain of blocks of bucket h(k) for the sought records

Inserts/deletes are basic linked list insert/deletes

Collisions can lead to deteriorated performance by having few

long chains


Static hashing

Uses fixed hash function and number of buckets

If #buckets is too small, performance will degrade due to frequent

overflows and long chains

If #buckets is made large to anticipate future growth, secondary storage

is wasted initially or whenever the data file shrinks

One rather expensive option is to periodically rebuild the hash

index

Another option would be to devise hashing that allows for the

hash function or #buckets to change dynamically with small

overhead


Extensible hashing

Uses a hash function h that generates values over [0, 2b)at any time it uses only a prefix (high order bits) of i bits of the hashi is initially 0 and changes as the size of the data file changesit maintains an bucket array with 2i pointers to bucketsthe i-prefix h(K)[i] is used to index into the bucket array Multiple entries of the bucket array may point to the same bucket

Each bucket j stores with it the bucket prefix ij All the keys it contains have the same value for their ij prefixThere are 2 (i-ij) pointers to bucket j from the bucket array

The number of distinct buckets changes dynamically due to coalescing and splitting of buckets.


Extensible hashing


Operations with extensible hashing

Lookup bucket for search key K

Use the i high order bits h(K)[i] of the hash to index into the bucket array and

get the bucket

Inserting search key K into bucket j

If there is room in the bucket j insert record in the bucket

Else split the bucket j and attempt the insertion again (use overflow buckets if

full again)

Deleting search key K from bucket j Delete K from bucket j Remove the bucket j if empty, by nullifying the relevant bucket array entry Coalescing of buckets can be done

can coalesce only with a “buddy” bucket having same value of ij and same ij –1 prefix, if it is present

Decreasing bucket array size is also possible


Splitting buckets in extensible hashing

Splitting bucket j depends on the #pointers to it in the bucket

array

If i > ij (more than one pointer to bucket j)

allocate a new bucket z with bucket prefix ij +1

Set the bucket prefix of j to ij +1

make the highest half of the bucket array entries pointing to j to now

point to z

remove and reinsert each record in bucket j

If i = ij (only one pointer to bucket j)

increment i and double the size of the bucket array

replace each entry in the bucket array by two entries that point to the

same bucket


Initial Extendable Hash structure, bucket size = 2 records

Extensible hashing - example


After inserting of one Brighton and two Downtown records

After inserting of Mianus



After inserting three Perryridge records




Hash structure after insertion of Redwood and Round Hill records


Extensible hashing

AdvantagesHash performance does not degrade with growth of file

Minimal space overhead

No full reorganizations

Disadvantages Extra level of indirection to find desired record

Not bad, if bucket array is in memory

Bucket array may itself become very big (larger than memory)

Need a tree structure to locate desired record in the structure!

Changing size of bucket array is an expensive operation

Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows


Linear hashing

Hash function h that generates values over [0, 2b)

Uses suffix h(K)[i] of i low order bits of hash h(K)

Maintains array with n buckets

Attempts to limit the load factor r/n, where r=#records

Linear hashing has all the advantages of extensible hashing and

Grows the #buckets slower

Less expensive to change the size of the bucket array

Long chains still possible


Operations for Linear hashing

Lookup bucket for search key K

If m=h(K)[i] < n then return bucket m else return bucket m-2i-1

Insert record with search key K

Lookup the bucket j for K

If bucket j has space insert it there, else chain a new block to the bucket

and insert the record there

If the load factor is above a threshold

add a new bucket to the bucket array, whose number is 1a2a3…ai

split the records in bucket 0a2a3…ai with the new bucket

If n=2i then increment i by 1


Indexing vs hashing

Hashing good for probes given key

SELECT * FROM R WHERE R.K=5

Indexing good for range queries

SELECT * FROM R WHERE R.K BETWEEN 5 AND 10;


Index definition in SQL

CREATE {UNIQUE} INDEX idxname ON rname(attrs)

DROP INDEX idxname;

index structures [13]

Documents