lecture 3 a b indexing

41
 1  Indexing Structures  Indexing Structures

Upload: ma-jessica

Post on 14-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 1/41

  1

 Indexing Structures Indexing Structures

Page 2: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 2/41

  2

 Database indexing Database indexing

• Single-Level Ordered Index

– Primary Index

– Clustered index

– Secondary Index

• Multi-Level Index

• Dynamic Multi-Level Index– Trees

– B+-trees

Page 3: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 3/41

  3

 Basic Concepts Basic Concepts

• Indexing mechanisms used for more efficient search fora record in the data file.

– E.g., author catalog in library

• An index is a file of entries of the form

which is ordered by field value

• Field value - attribute to set of attributes used to look uprecords in a file

– Not to be confused with the primary key, candidate key andsuperkey

Pointer to a recordField value

Page 4: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 4/41

  4

 Basic Concepts Basic Concepts (contd..)(contd..)

• Index files are typically much smaller than theoriginal file

• The index is usually specified on one field of the file

(although it could be specified on several fields)

• Indexes can also be characterized as dense or sparse.

– A dense index has an index entry for every search key

value (and hence every record) in the data file.

– A sparse (or nondense) index, on the other hand, has

index entries for only some of the search values

Page 5: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 5/41

  5

Primary indexPrimary index

•For an ordered file whose ordering field is a key

• Two fields

– Data from the ordering key field (Primary Key)

– Pointer to the data

• Referred as < K(i), P(i) >, with index entry i

• Fixed length records

• Number of index entries = number of disk blocks

• Includes one index entry for each block  in the data file(sparse); the index entry has the key field value for the

 first record  in the block, which is called the block 

anchor 

Page 6: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 6/41

  6

Primary index (contd..)Primary index (contd..)

FIGURE 14.1

Primary index on

the ordering key

field of the file

Page 7: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 7/41

  7

Clustering indexClustering index

• For an ordered file whose ordering field is not a key,

unlike primary index

– not unique

• Clustering index has same values as the field(clustering field)

• Includes one index entry for each distinct value of the

clustering field; the index entry points to the first data

block that contains records with that field value.

• Non-dense index

Page 8: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 8/41

  8

Clustering index (contd..)Clustering index (contd..)

FIGURE 14.2

A clustering index on

the DEPTNUMBERordering nonkey field

of an EMPLOYEE

file.

Page 9: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 9/41

  9

Clustering index (contd..)Clustering index (contd..)

FIGURE 14.3

Clustering index with a

separate block clusterfor each group of 

records that share the

same value for the

clustering field.

Page 10: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 10/41

  10

Secondary IndexesSecondary Indexes

• Secondary indexes are for unordered files or

for attributes which are not the ordering field.

• Two types

– Secondary index on a key field (which have a

unique value in every record)

– Secondary index on a non-key field (with duplicate

values)

• Can be many secondary indexes

Page 11: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 11/41

  11

Secondary Index on a Key Field Secondary Index on a Key Field 

• Uses a non-ordering field

• Key field → unique values

• Two entries, < K(i), P(i) >– Data from some non-ordering fields

– Pointer to the block in which the record is stored orto the record itself 

• Key is called a secondary key• Includes one entry for each record  in the data

file, hence dense index

Page 12: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 12/41

  12

Secondary Indexes (contd)Secondary Indexes (contd)

FIGURE 14.4

A dense secondary index

(with block pointers) on

a non-ordering key field

of a file.

Page 13: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 13/41

  13

Secondary Indexes on a Non-key Field Secondary Indexes on a Non-key Field 

• Many records can have the same value for the indexing field

• Options

1. Several index entries for the same field value – one for eachrecord

• Dense index

2. Variable-length records for the index entries with a repeating fieldfor the pointers

• Non-dense index

3. Create an extra level of indirection to handle multiple pointers

• Non-dense index

• Use cluster or linked list of pointer blocks if K(i) occurs intoo many records (i.e too many record pointers).

Page 14: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 14/41

  14

Secondary Indexes on a Non-key Field (contd..)Secondary Indexes on a Non-key Field (contd..)

FIGURE 14.5

A secondary index

(with recored

pointers) on a non-

key fieldimplemented using

one level of 

indirection so that

index entries are of 

fixed length and

have unique fieldvalues.

Page 15: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 15/41

  15

Single Level Index - SummarySingle Level Index - Summary

Page 16: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 16/41

  16

 Multilevel Indexes Multilevel Indexes

• So far, single-level index is an ordered index file

• We can create a primary index to the index itself  – the original index file is called the first-level index 

– the index to the index is called the second-level index (oneentry per block in first level)

• We can repeat the process, creating a third, fourth, ...,top level until all entries of the top level fit in onedisk block 

• A multi-level index can be created for any type of first-level index (primary, secondary, clustering) aslong as the first-level index consists of more than one disk block 

M l il l I d ( d )M l il l I d ( d )

Page 17: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 17/41

  17

 Multilevel Indexes (contd..) Multilevel Indexes (contd..)

FIGURE 14.6

A two-level primary

index resemblingISAM (Indexed

Sequential Access

Method)

organization.

Page 18: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 18/41

  18

 Dynamic Multilevel Indexes Using B+ -Trees Dynamic Multilevel Indexes Using B+ -Trees

• In a B+-tree, data pointers are stored only at

the leaf nodes of the tree.

• The structure of leaf node differs from the

structure of the internal nodes.

• Leaf nodes are usually linked together for

ordered access on the search field

Page 19: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 19/41

  19

 Dynamic Multilevel Indexes Using B+-Trees Dynamic Multilevel Indexes Using B+-Trees

Structure of internal nodes

1. Each internal node is of the form<P1, K1, P2, K2,…,Pq-1, Kq-1, Pq>

where q ≤ p and each Pi is a tree pointer

• Within each internal node, K1 < K2 < …< Kq-1

• Each internal node has at most p tree pointers

• Each internal node, except the root and leaf nodes, has at leastceiling(p/2) tree pointers. The root node has at least two treepointers if it is an internal node.

• An internal node with q pointers, q ≤ p, has q-1 search field

values.

Page 20: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 20/41

  20

Structure of the leaf nodes

1. Each leaf node is of the form

<<K1, Pr1>, <K2, Pr2>, …, <Kq-1, Prq-1>, Pnext>

where q ≤ p, each Pri is a data pointer and Pnext points to the

next leaf node of the B+-tree

• Within each leaf node, K1 < K2 < …< Kq-1; q ≤ p

• Each Pri points to the record whose field value is K i or to a file

block containing the record

1. Each leaf node has at least ceiling(p/2) values.

• All leaf nodes are at the same level

 Dynamic Multilevel Indexes Using B+-Trees (contd..) Dynamic Multilevel Indexes Using B+-Trees (contd..)

Page 21: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 21/41

  21

 B+ -Tree B+ -Tree

Page 22: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 22/41

  22

 Dynamic Multilevel Indexes Using Dynamic Multilevel Indexes Using

 B-Trees and B+-Trees B-Trees and B+-Trees

• An insertion into a node that is not full is quite

efficient; if a node is full the insertion causes a split

into two nodes

• Splitting may propagate to other tree levels

• A deletion is quite efficient if a node does not become

less than half full

• If a deletion causes a node to become less than half 

full, it must be merged with neighboring nodes

Page 23: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 23/41

  23

FIGURE 14.12

An example of 

insertion in a B+-tree with q = 3

and pleaf = 2.

Page 24: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 24/41

  24

FIGURE 14.13

An example of 

deletion from aB+-tree.

Page 25: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 25/41

CSE2147 Database Administration

File Organization and EfficientSearching Mechanisms -

Hashing

Page 26: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 26/41

CSE2147 2

Storage of Databases

Databases store large amounts of data that must

persist over long periods of time

Data is accessed and processed repeatedly

Most databases are stored permanently on

magnetic disk secondary storage because

High storage capacity (databases are too large to fit

in main memory) Nonvolatile storage

Cost of storage is less

Page 27: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 27/41

CSE2147 3

Disk Storage Devices

Data stored as magnetized areas on magnetic disksurfaces.

A disk pack  contains several magnetic disks

connected to a rotating spindle.

Disks are divided into concentric circular tracks  on

each disk surface 

Because a track usually contains a large amount of

information, it is divided into smaller blocks or sectors .

Whole blocks are transferred between disk and main

memory for processing.

Page 28: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 28/41

Unordered Files

Also called a heap or a pile file.• New records are inserted at the end of the file.

• A linear search through the file records is necessary to

search for a record.

 –  This requires reading and searching half the file blocks on

the average, and is hence quite expensive.

• Record insertion is quite efficient.

• Reading the records in order of a particular field

requires sorting the file records.

Page 29: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 29/41

Ordered Files

Also called a sequential file.• File records are kept sorted by the values of an ordering  field .

• Insertion is expensive: records must be inserted in the correct order.

 – 

It is common to keep a separate unordered overflow (ortransaction) file for new records to improve insertion efficiency;this is periodically merged with the main ordered file.

• A binary search can be used to search for a record on its ordering field value.

 –  This requires reading and searching log2 of the file blocks on theaverage, an improvement over linear search.

• Reading the records in order of the ordering field is quite efficient.

Ordered Files (cont.)

Page 30: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 30/41

6

Ordered Files (cont.)

Page 31: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 31/41

Hashing

• Another type of internal search structure

• Provides fast access to records on certainsearch conditions

• Organisation is usually called hash file

• Internal Hashing : for internal files

• External Hashing : for disk files

Hashed Files

Page 32: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 32/41

Hashed Files• Hashing is typically implemented as a hash table through the

use of an array of records

 –  E.g. array index range 0 to M-1 then we have M slots

 –  Hashing function transforms hash field value into an integerbetween 0 to M-1

 –  Several hashing functions can be used to get the array value

•  The problem with most hash functions is that they do notguarantee that distinct values will hash to distinct addresses.

 –   The number of possible a hash field can take is usually muchlarger than the address space (no of available addresses forrecords)

 –  Collision occurs when the hash field value of a record that isbeing inserted hashes to an address that already contains a

different record.

h d il ( )

Page 33: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 33/41

Hashed Files (cont.)

 There are numerous methods for collision resolution, including the

following:

•Open addressing: Proceeding from the occupied positionspecified by the hash address, the program checks the subsequentpositions in order until an unused (empty) position is found.

•Chaining: For this method, various overflow locations are kept,usually by extending the array with a number of overflowpositions. In addition, a pointer field is added to each recordlocation. A collision is resolved by placing the new record in anunused overflow location and setting the pointer of the occupied

hash address location to the address of that overflow location.

•Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, theprogram uses open addressing or applies a third hash function andthen uses open addressing if necessary.

Page 34: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 34/41

Internalhashing

datastructures.

(a) Array of M positions

for use in internalhashing. (b) Collisionresolution by chaining

records.

Hashed Files

Page 35: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 35/41

Hashed Files• Hashing for disk files is called External Hashing

•  The file blocks are divided into M equal-sized buckets, numberedbucket0, bucket1, ..., bucketM-1.

 –   Typically, a bucket corresponds to one (or a fixed number of)disk block.

• One of the file fields is designated to be the hash key of the file.

•  The record with hash key value K is stored in bucket i, wherei=h(K), and h is the hashing function.

• Search is very efficient on the hash key.

• Collisions occur when a new record hashes to a bucket that isalready full.

 –  An overflow file is kept for storing such records.

 – 

Overflow records that hash to each bucket can be linkedtogether.

Hashed Files (cont )

Page 36: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 36/41

12

Hashed Files (cont.)

Page 37: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 37/41

Hashed Files (cont.)•  To reduce overflow records, a hash file is typically kept

70-80% full.

•  The hash function h should distribute the recordsuniformly among the buckets

 –  Otherwise, search time will be increased because

many overflow records will exist.

• Main disadvantages of static external hashing:

 –  Fixed number of buckets M is a problem if thenumber of records in the file grows or shrinks.

 –  Ordered access on the hash key is quite inefficient(requires sorting the records).

Hashed Files – Overflow

Page 38: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 38/41

Hashed Files – OverflowHandling

Dynamic And Extendible

Page 39: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 39/41

Dynamic And ExtendibleHashed Files

• Dynamic and Extendible Hashing Techniques

 –  Hashing techniques are adapted to allow the dynamicgrowth and shrinking of the number of file records.

 –   These techniques include the following: dynamichashing, extendible hashing, and linear hashing.

• Both dynamic and extendible hashing use the binaryrepresentation of the hash value h(K) in order toaccess a directory.

 –  In dynamic hashing the directory is a binary tree.

 –  In extendible hashing the directory is an array of size 2d

where d is called the global depth.

Dynamic And ExtendibleH hi ( )

Page 40: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 40/41

Hashing (cont.)•  The directories can be stored on disk, and they expand

or shrink dynamically. –  Directory entries point to the disk blocks that contain

the stored records.

• An insertion in a disk block that is full causes the block

to split into two blocks and the records are redistributedamong the two blocks.

 –   The directory is updated appropriately.

Dynamic and extendible hashing do not require anoverflow area.

• Linear hashing does require an overflow area but doesnot use a directory.

 –  Blocks are split in linear order as the file expands.

Extendible

Page 41: Lecture 3 a B Indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 41/41

17

Hashing