index structures [13]
DESCRIPTION
Index Structures [13]. Terms. Search key Values for a collection of fields used to find records with “matching” values of these fields Index entry Pair (search key value, database pointer) Index structure - PowerPoint PPT PresentationTRANSCRIPT
Dr. Kalpakis
CMSC 661, Principles of Database Systems
http://www.csee.umbc.edu/~kalpakis/Courses/661
Index Structures [13]
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 2
Terms
Search key
Values for a collection of fields used to find records with “matching” values of these fields
Index entry
Pair (search key value, database pointer)
Index structure
A data structure that allows us to locate a record that satisfies the search key conditions
Primary vs secondary index
Primary indexes return database address of record
Secondary return value of primary key of record
Dense vs sparse
Clustered vs non-clustered
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 3
Indexes on sequential files
A index file consists of index entry records
A sorted file of records is called a sequential file
the sorted file itself is called the data file
Records could be stored in
contiguous sequence blocks (can do binary search!)
a linked list of blocks
Index file is sorted in the same order as the data file
Dense indexes
Have an index entry for each record in the data file
Index file likely much smaller than data file
May fit in main memory
Useful to find records and answer some queries without reading the (likely
larger) data file
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 4
Indexes on sequential files
Three major approaches to reorganize a data file
Use overflow blocks
Insert new blocks in the sequential order
Move records between adjacent block
Since an index file, dense or sparse, is sorted on its search key,
it is a sequential file itself
Apply the same principles to it as for the data file
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 5
Secondary indexes
Primary indexes determine the location of the indexed records
Secondary indexes do not determine the location of the indexed
records
Secondary indexes are always dense, typically with duplicates
Indirection for secondary indexes
To avoid repeating the search key for duplicates, we can have for each
distinct search key value a bucket with references to all the records with
that search key value
idea naturally leads to inverted indexes in Information Retrieval
For each word, have a secondary index of documents that contain it
Combine all these indexes into one
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 6
Conventional indexes
Advantages
Simple
Index is a sequential file
Good for scans
Disadvantages
Inserts are expensive
Loose sequential order
Become imbalanced
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 7
B-trees
An index that organizes its blocks into a tree
A B-tree with parameter n
Is a balanced tree since all leaves are required to be at the same level
Each node has space for n search keys and n+1 database pointers to
blocks of the tree
All internal nodes have more than ceil(n+1/2) children
Each leaf has no less than floor(n+1/2) keys from the data file
Root has at least two children
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 8
B-trees
The keys in each node are stored in increasing sorted order
The keys in the leafs
Are copies of keys from the data file
Appear among all the leaves in increasing sorted order
Each key K in a non-leaf node
Is flanked by two pointers to subtrees
The subtree to its left contains keys < K
The subtree to its right contains keys >= K
K is the smallest key appearing among the leaves of the right subtree
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 9
B-tree example
Root
100
120
150
180
30
3 5 11
30
35
100
101
110
120
130
150
156
179
180
200
n=3
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 10
B-tree example - non-leaf node
57
81
95
to keys < 57 to keys 57 k<81 to keys 81 k<95 to keys 95
from non-leaf node
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 11
B-tree example – leaf node
57
81
95
To r
eco
rd
wit
h k
ey
57
wit
h k
ey
81
wit
h k
ey
85
to next leaf in sequence
from non-leaf node
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 12
Operations and B-trees
Lookup
Equality queries
Traverse the tree from the root down to an appropriate leaf using the keys
stored at the nodes to decide which subtree(s) to visit
Range queries in which records with search keys in a given range are
sought for
Insertion
Find leaf block B to insert search key K
If B is not full insert K there
If B is full, create new block B’, distribute the keys among the two blocks,
and recursively insert a search key for B’ to the parent of B,
update the keys that appear on the paths from B (and B’) to the root
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 13
Operations and B-trees
Deletion
To delete search key K, locate leaf block B that contains K
Delete entry for K from B
If B now has fewer than the required keys/pointers
If sibling node B’ has an extra entry that could give to B, move it to B
Else, coalesce by moving all the entries from B to one of its siblings and
recursively delete the entry for B from its parent
Update the keys along the path from the modified blocks to the root
Coalescing is often not implemented
Too hard for little benefits
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 14
B-trees vs Indexed Sequential Files
Concurrency control harder in B-trees
B-trees consume more spaceStatic index saves enough accesses to allow for reorganization
On the other handDBA does not know when to reorganize
DBA does not know how full to load pages of new index
BufferingB-tree has fixed buffer requirements
Static index must read several overflow blocks to be efficient (large & variable size buffers needed for this)
LRU is bad for B-trees (always keep the root block in memory)
B-trees are better
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 15
Interesting problems
What is a good index structure when
records tend to be inserted with keys that are larger than existing values?
we want to remove older data
What is the best n?
As disks get bigger, main memories increase, and CPUs get faster?
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 16
Static hashing
Maintain an array of buckets, each bucket being a chain of
blocks
Use a hash function to compute the bucket index for record(s)
with given search key K
Search chain of blocks of bucket h(k) for the sought records
Inserts/deletes are basic linked list insert/deletes
Collisions can lead to deteriorated performance by having few
long chains
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 17
Static hashing
Uses fixed hash function and number of buckets
If #buckets is too small, performance will degrade due to frequent
overflows and long chains
If #buckets is made large to anticipate future growth, secondary storage
is wasted initially or whenever the data file shrinks
One rather expensive option is to periodically rebuild the hash
index
Another option would be to devise hashing that allows for the
hash function or #buckets to change dynamically with small
overhead
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 18
Extensible hashing
Uses a hash function h that generates values over [0, 2b)at any time it uses only a prefix (high order bits) of i bits of the hashi is initially 0 and changes as the size of the data file changesit maintains an bucket array with 2i pointers to bucketsthe i-prefix h(K)[i] is used to index into the bucket array Multiple entries of the bucket array may point to the same bucket
Each bucket j stores with it the bucket prefix ij All the keys it contains have the same value for their ij prefixThere are 2 (i-ij) pointers to bucket j from the bucket array
The number of distinct buckets changes dynamically due to coalescing and splitting of buckets.
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 19
Extensible hashing
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 20
Operations with extensible hashing
Lookup bucket for search key K
Use the i high order bits h(K)[i] of the hash to index into the bucket array and
get the bucket
Inserting search key K into bucket j
If there is room in the bucket j insert record in the bucket
Else split the bucket j and attempt the insertion again (use overflow buckets if
full again)
Deleting search key K from bucket j Delete K from bucket j Remove the bucket j if empty, by nullifying the relevant bucket array entry Coalescing of buckets can be done
can coalesce only with a “buddy” bucket having same value of ij and same ij –1 prefix, if it is present
Decreasing bucket array size is also possible
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 21
Splitting buckets in extensible hashing
Splitting bucket j depends on the #pointers to it in the bucket
array
If i > ij (more than one pointer to bucket j)
allocate a new bucket z with bucket prefix ij +1
Set the bucket prefix of j to ij +1
make the highest half of the bucket array entries pointing to j to now
point to z
remove and reinsert each record in bucket j
If i = ij (only one pointer to bucket j)
increment i and double the size of the bucket array
replace each entry in the bucket array by two entries that point to the
same bucket
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 22
Initial Extendable Hash structure, bucket size = 2 records
Extensible hashing - example
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 23
After inserting of one Brighton and two Downtown records
After inserting of Mianus
Extensible hashing - example
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 24
After inserting three Perryridge records
Extensible hashing - example
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 25
Extensible hashing - example
Hash structure after insertion of Redwood and Round Hill records
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 26
Extensible hashing
AdvantagesHash performance does not degrade with growth of file
Minimal space overhead
No full reorganizations
Disadvantages Extra level of indirection to find desired record
Not bad, if bucket array is in memory
Bucket array may itself become very big (larger than memory)
Need a tree structure to locate desired record in the structure!
Changing size of bucket array is an expensive operation
Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 27
Linear hashing
Hash function h that generates values over [0, 2b)
Uses suffix h(K)[i] of i low order bits of hash h(K)
Maintains array with n buckets
Attempts to limit the load factor r/n, where r=#records
Linear hashing has all the advantages of extensible hashing and
Grows the #buckets slower
Less expensive to change the size of the bucket array
Long chains still possible
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 28
Operations for Linear hashing
Lookup bucket for search key K
If m=h(K)[i] < n then return bucket m else return bucket m-2i-1
Insert record with search key K
Lookup the bucket j for K
If bucket j has space insert it there, else chain a new block to the bucket
and insert the record there
If the load factor is above a threshold
add a new bucket to the bucket array, whose number is 1a2a3…ai
split the records in bucket 0a2a3…ai with the new bucket
If n=2i then increment i by 1
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 29
Indexing vs hashing
Hashing good for probes given key
SELECT * FROM R WHERE R.K=5
Indexing good for range queries
SELECT * FROM R WHERE R.K BETWEEN 5 AND 10;
CMSC 661 - Dr. Kalpakis - Dr. Kalpakis 30
Index definition in SQL
CREATE {UNIQUE} INDEX idxname ON rname(attrs)
DROP INDEX idxname;