lecture 3 a b indexing
TRANSCRIPT
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 1/41
1
Indexing Structures Indexing Structures
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 2/41
2
Database indexing Database indexing
• Single-Level Ordered Index
– Primary Index
– Clustered index
– Secondary Index
• Multi-Level Index
• Dynamic Multi-Level Index– Trees
– B+-trees
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 3/41
3
Basic Concepts Basic Concepts
• Indexing mechanisms used for more efficient search fora record in the data file.
– E.g., author catalog in library
• An index is a file of entries of the form
which is ordered by field value
• Field value - attribute to set of attributes used to look uprecords in a file
– Not to be confused with the primary key, candidate key andsuperkey
Pointer to a recordField value
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 4/41
4
Basic Concepts Basic Concepts (contd..)(contd..)
• Index files are typically much smaller than theoriginal file
• The index is usually specified on one field of the file
(although it could be specified on several fields)
• Indexes can also be characterized as dense or sparse.
– A dense index has an index entry for every search key
value (and hence every record) in the data file.
– A sparse (or nondense) index, on the other hand, has
index entries for only some of the search values
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 5/41
5
Primary indexPrimary index
•For an ordered file whose ordering field is a key
• Two fields
– Data from the ordering key field (Primary Key)
– Pointer to the data
• Referred as < K(i), P(i) >, with index entry i
• Fixed length records
• Number of index entries = number of disk blocks
• Includes one index entry for each block in the data file(sparse); the index entry has the key field value for the
first record in the block, which is called the block
anchor
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 6/41
6
Primary index (contd..)Primary index (contd..)
FIGURE 14.1
Primary index on
the ordering key
field of the file
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 7/41
7
Clustering indexClustering index
• For an ordered file whose ordering field is not a key,
unlike primary index
– not unique
• Clustering index has same values as the field(clustering field)
• Includes one index entry for each distinct value of the
clustering field; the index entry points to the first data
block that contains records with that field value.
• Non-dense index
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 8/41
8
Clustering index (contd..)Clustering index (contd..)
FIGURE 14.2
A clustering index on
the DEPTNUMBERordering nonkey field
of an EMPLOYEE
file.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 9/41
9
Clustering index (contd..)Clustering index (contd..)
FIGURE 14.3
Clustering index with a
separate block clusterfor each group of
records that share the
same value for the
clustering field.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 10/41
10
Secondary IndexesSecondary Indexes
• Secondary indexes are for unordered files or
for attributes which are not the ordering field.
• Two types
– Secondary index on a key field (which have a
unique value in every record)
– Secondary index on a non-key field (with duplicate
values)
• Can be many secondary indexes
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 11/41
11
Secondary Index on a Key Field Secondary Index on a Key Field
• Uses a non-ordering field
• Key field → unique values
• Two entries, < K(i), P(i) >– Data from some non-ordering fields
– Pointer to the block in which the record is stored orto the record itself
• Key is called a secondary key• Includes one entry for each record in the data
file, hence dense index
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 12/41
12
Secondary Indexes (contd)Secondary Indexes (contd)
FIGURE 14.4
A dense secondary index
(with block pointers) on
a non-ordering key field
of a file.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 13/41
13
Secondary Indexes on a Non-key Field Secondary Indexes on a Non-key Field
• Many records can have the same value for the indexing field
• Options
1. Several index entries for the same field value – one for eachrecord
• Dense index
2. Variable-length records for the index entries with a repeating fieldfor the pointers
• Non-dense index
3. Create an extra level of indirection to handle multiple pointers
• Non-dense index
• Use cluster or linked list of pointer blocks if K(i) occurs intoo many records (i.e too many record pointers).
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 14/41
14
Secondary Indexes on a Non-key Field (contd..)Secondary Indexes on a Non-key Field (contd..)
FIGURE 14.5
A secondary index
(with recored
pointers) on a non-
key fieldimplemented using
one level of
indirection so that
index entries are of
fixed length and
have unique fieldvalues.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 15/41
15
Single Level Index - SummarySingle Level Index - Summary
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 16/41
16
Multilevel Indexes Multilevel Indexes
• So far, single-level index is an ordered index file
• We can create a primary index to the index itself – the original index file is called the first-level index
– the index to the index is called the second-level index (oneentry per block in first level)
• We can repeat the process, creating a third, fourth, ...,top level until all entries of the top level fit in onedisk block
• A multi-level index can be created for any type of first-level index (primary, secondary, clustering) aslong as the first-level index consists of more than one disk block
M l il l I d ( d )M l il l I d ( d )
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 17/41
17
Multilevel Indexes (contd..) Multilevel Indexes (contd..)
FIGURE 14.6
A two-level primary
index resemblingISAM (Indexed
Sequential Access
Method)
organization.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 18/41
18
Dynamic Multilevel Indexes Using B+ -Trees Dynamic Multilevel Indexes Using B+ -Trees
• In a B+-tree, data pointers are stored only at
the leaf nodes of the tree.
• The structure of leaf node differs from the
structure of the internal nodes.
• Leaf nodes are usually linked together for
ordered access on the search field
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 19/41
19
Dynamic Multilevel Indexes Using B+-Trees Dynamic Multilevel Indexes Using B+-Trees
Structure of internal nodes
1. Each internal node is of the form<P1, K1, P2, K2,…,Pq-1, Kq-1, Pq>
where q ≤ p and each Pi is a tree pointer
• Within each internal node, K1 < K2 < …< Kq-1
• Each internal node has at most p tree pointers
• Each internal node, except the root and leaf nodes, has at leastceiling(p/2) tree pointers. The root node has at least two treepointers if it is an internal node.
• An internal node with q pointers, q ≤ p, has q-1 search field
values.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 20/41
20
Structure of the leaf nodes
1. Each leaf node is of the form
<<K1, Pr1>, <K2, Pr2>, …, <Kq-1, Prq-1>, Pnext>
where q ≤ p, each Pri is a data pointer and Pnext points to the
next leaf node of the B+-tree
• Within each leaf node, K1 < K2 < …< Kq-1; q ≤ p
• Each Pri points to the record whose field value is K i or to a file
block containing the record
1. Each leaf node has at least ceiling(p/2) values.
• All leaf nodes are at the same level
Dynamic Multilevel Indexes Using B+-Trees (contd..) Dynamic Multilevel Indexes Using B+-Trees (contd..)
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 21/41
21
B+ -Tree B+ -Tree
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 22/41
22
Dynamic Multilevel Indexes Using Dynamic Multilevel Indexes Using
B-Trees and B+-Trees B-Trees and B+-Trees
• An insertion into a node that is not full is quite
efficient; if a node is full the insertion causes a split
into two nodes
• Splitting may propagate to other tree levels
• A deletion is quite efficient if a node does not become
less than half full
• If a deletion causes a node to become less than half
full, it must be merged with neighboring nodes
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 23/41
23
FIGURE 14.12
An example of
insertion in a B+-tree with q = 3
and pleaf = 2.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 24/41
24
FIGURE 14.13
An example of
deletion from aB+-tree.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 25/41
CSE2147 Database Administration
File Organization and EfficientSearching Mechanisms -
Hashing
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 26/41
CSE2147 2
Storage of Databases
Databases store large amounts of data that must
persist over long periods of time
Data is accessed and processed repeatedly
Most databases are stored permanently on
magnetic disk secondary storage because
High storage capacity (databases are too large to fit
in main memory) Nonvolatile storage
Cost of storage is less
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 27/41
CSE2147 3
Disk Storage Devices
Data stored as magnetized areas on magnetic disksurfaces.
A disk pack contains several magnetic disks
connected to a rotating spindle.
Disks are divided into concentric circular tracks on
each disk surface
Because a track usually contains a large amount of
information, it is divided into smaller blocks or sectors .
Whole blocks are transferred between disk and main
memory for processing.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 28/41
Unordered Files
•
Also called a heap or a pile file.• New records are inserted at the end of the file.
• A linear search through the file records is necessary to
search for a record.
– This requires reading and searching half the file blocks on
the average, and is hence quite expensive.
• Record insertion is quite efficient.
• Reading the records in order of a particular field
requires sorting the file records.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 29/41
Ordered Files
•
Also called a sequential file.• File records are kept sorted by the values of an ordering field .
• Insertion is expensive: records must be inserted in the correct order.
–
It is common to keep a separate unordered overflow (ortransaction) file for new records to improve insertion efficiency;this is periodically merged with the main ordered file.
• A binary search can be used to search for a record on its ordering field value.
– This requires reading and searching log2 of the file blocks on theaverage, an improvement over linear search.
• Reading the records in order of the ordering field is quite efficient.
Ordered Files (cont.)
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 30/41
6
Ordered Files (cont.)
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 31/41
Hashing
• Another type of internal search structure
• Provides fast access to records on certainsearch conditions
• Organisation is usually called hash file
• Internal Hashing : for internal files
• External Hashing : for disk files
Hashed Files
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 32/41
Hashed Files• Hashing is typically implemented as a hash table through the
use of an array of records
– E.g. array index range 0 to M-1 then we have M slots
– Hashing function transforms hash field value into an integerbetween 0 to M-1
– Several hashing functions can be used to get the array value
• The problem with most hash functions is that they do notguarantee that distinct values will hash to distinct addresses.
– The number of possible a hash field can take is usually muchlarger than the address space (no of available addresses forrecords)
– Collision occurs when the hash field value of a record that isbeing inserted hashes to an address that already contains a
different record.
h d il ( )
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 33/41
Hashed Files (cont.)
There are numerous methods for collision resolution, including the
following:
•Open addressing: Proceeding from the occupied positionspecified by the hash address, the program checks the subsequentpositions in order until an unused (empty) position is found.
•Chaining: For this method, various overflow locations are kept,usually by extending the array with a number of overflowpositions. In addition, a pointer field is added to each recordlocation. A collision is resolved by placing the new record in anunused overflow location and setting the pointer of the occupied
hash address location to the address of that overflow location.
•Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, theprogram uses open addressing or applies a third hash function andthen uses open addressing if necessary.
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 34/41
Internalhashing
datastructures.
(a) Array of M positions
for use in internalhashing. (b) Collisionresolution by chaining
records.
Hashed Files
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 35/41
Hashed Files• Hashing for disk files is called External Hashing
• The file blocks are divided into M equal-sized buckets, numberedbucket0, bucket1, ..., bucketM-1.
– Typically, a bucket corresponds to one (or a fixed number of)disk block.
• One of the file fields is designated to be the hash key of the file.
• The record with hash key value K is stored in bucket i, wherei=h(K), and h is the hashing function.
• Search is very efficient on the hash key.
• Collisions occur when a new record hashes to a bucket that isalready full.
– An overflow file is kept for storing such records.
–
Overflow records that hash to each bucket can be linkedtogether.
Hashed Files (cont )
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 36/41
12
Hashed Files (cont.)
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 37/41
Hashed Files (cont.)• To reduce overflow records, a hash file is typically kept
70-80% full.
• The hash function h should distribute the recordsuniformly among the buckets
– Otherwise, search time will be increased because
many overflow records will exist.
• Main disadvantages of static external hashing:
– Fixed number of buckets M is a problem if thenumber of records in the file grows or shrinks.
– Ordered access on the hash key is quite inefficient(requires sorting the records).
Hashed Files – Overflow
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 38/41
Hashed Files – OverflowHandling
Dynamic And Extendible
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 39/41
Dynamic And ExtendibleHashed Files
• Dynamic and Extendible Hashing Techniques
– Hashing techniques are adapted to allow the dynamicgrowth and shrinking of the number of file records.
– These techniques include the following: dynamichashing, extendible hashing, and linear hashing.
• Both dynamic and extendible hashing use the binaryrepresentation of the hash value h(K) in order toaccess a directory.
– In dynamic hashing the directory is a binary tree.
– In extendible hashing the directory is an array of size 2d
where d is called the global depth.
Dynamic And ExtendibleH hi ( )
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 40/41
Hashing (cont.)• The directories can be stored on disk, and they expand
or shrink dynamically. – Directory entries point to the disk blocks that contain
the stored records.
• An insertion in a disk block that is full causes the block
to split into two blocks and the records are redistributedamong the two blocks.
– The directory is updated appropriately.
•
Dynamic and extendible hashing do not require anoverflow area.
• Linear hashing does require an overflow area but doesnot use a directory.
– Blocks are split in linear order as the file expands.
Extendible
7/27/2019 Lecture 3 a B Indexing
http://slidepdf.com/reader/full/lecture-3-a-b-indexing 41/41
17
Hashing