lecture 3 a b indexing

7/27/2019 Lecture 3 a B Indexing

http://slidepdf.com/reader/full/lecture-3-a-b-indexing 1/41

1

Indexing Structures Indexing Structures



2

Database indexing Database indexing

• Single-Level Ordered Index

– Primary Index

– Clustered index

– Secondary Index

• Multi-Level Index

• Dynamic Multi-Level Index– Trees

– B+-trees



3

Basic Concepts Basic Concepts

• Indexing mechanisms used for more efficient search fora record in the data file.

– E.g., author catalog in library

• An index is a file of entries of the form

which is ordered by field value

• Field value - attribute to set of attributes used to look uprecords in a file

– Not to be confused with the primary key, candidate key andsuperkey

Pointer to a recordField value



4

Basic Concepts Basic Concepts (contd..)(contd..)

• Index files are typically much smaller than theoriginal file

• The index is usually specified on one field of the file

(although it could be specified on several fields)

• Indexes can also be characterized as dense or sparse.

– A dense index has an index entry for every search key

value (and hence every record) in the data file.

– A sparse (or nondense) index, on the other hand, has

index entries for only some of the search values



5

Primary indexPrimary index

•For an ordered file whose ordering field is a key

• Two fields

– Data from the ordering key field (Primary Key)

– Pointer to the data

• Referred as < K(i), P(i) >, with index entry i

• Fixed length records

• Number of index entries = number of disk blocks

• Includes one index entry for each block in the data file(sparse); the index entry has the key field value for the

first record in the block, which is called the block

anchor



6

Primary index (contd..)Primary index (contd..)

FIGURE 14.1

Primary index on

the ordering key

field of the file



7

Clustering indexClustering index

• For an ordered file whose ordering field is not a key,

unlike primary index

– not unique

• Clustering index has same values as the field(clustering field)

• Includes one index entry for each distinct value of the

clustering field; the index entry points to the first data

block that contains records with that field value.

• Non-dense index



8

Clustering index (contd..)Clustering index (contd..)

FIGURE 14.2

A clustering index on

the DEPTNUMBERordering nonkey field

of an EMPLOYEE

file.



9

Clustering index (contd..)Clustering index (contd..)

FIGURE 14.3

Clustering index with a

separate block clusterfor each group of

records that share the

same value for the

clustering field.



10

Secondary IndexesSecondary Indexes

• Secondary indexes are for unordered files or

for attributes which are not the ordering field.

• Two types

– Secondary index on a key field (which have a

unique value in every record)

– Secondary index on a non-key field (with duplicate

values)

• Can be many secondary indexes



11

Secondary Index on a Key Field Secondary Index on a Key Field

• Uses a non-ordering field

• Key field → unique values

• Two entries, < K(i), P(i) >– Data from some non-ordering fields

– Pointer to the block in which the record is stored orto the record itself

• Key is called a secondary key• Includes one entry for each record in the data

file, hence dense index



12

Secondary Indexes (contd)Secondary Indexes (contd)

FIGURE 14.4

A dense secondary index

(with block pointers) on

a non-ordering key field

of a file.



13

Secondary Indexes on a Non-key Field Secondary Indexes on a Non-key Field

• Many records can have the same value for the indexing field

• Options

1. Several index entries for the same field value – one for eachrecord

• Dense index

2. Variable-length records for the index entries with a repeating fieldfor the pointers

• Non-dense index

3. Create an extra level of indirection to handle multiple pointers

• Non-dense index

• Use cluster or linked list of pointer blocks if K(i) occurs intoo many records (i.e too many record pointers).



14

Secondary Indexes on a Non-key Field (contd..)Secondary Indexes on a Non-key Field (contd..)

FIGURE 14.5

A secondary index

(with recored

pointers) on a non-

key fieldimplemented using

one level of

indirection so that

index entries are of

fixed length and

have unique fieldvalues.



15

Single Level Index - SummarySingle Level Index - Summary



16

Multilevel Indexes Multilevel Indexes

• So far, single-level index is an ordered index file

• We can create a primary index to the index itself – the original index file is called the first-level index

– the index to the index is called the second-level index (oneentry per block in first level)

• We can repeat the process, creating a third, fourth, ...,top level until all entries of the top level fit in onedisk block

• A multi-level index can be created for any type of first-level index (primary, secondary, clustering) aslong as the first-level index consists of more than one disk block

M l il l I d ( d )M l il l I d ( d )



17

Multilevel Indexes (contd..) Multilevel Indexes (contd..)

FIGURE 14.6

A two-level primary

index resemblingISAM (Indexed

Sequential Access

Method)

organization.



18

Dynamic Multilevel Indexes Using B+ -Trees Dynamic Multilevel Indexes Using B+ -Trees

• In a B+-tree, data pointers are stored only at

the leaf nodes of the tree.

• The structure of leaf node differs from the

structure of the internal nodes.

• Leaf nodes are usually linked together for

ordered access on the search field



19

Dynamic Multilevel Indexes Using B+-Trees Dynamic Multilevel Indexes Using B+-Trees

Structure of internal nodes

1. Each internal node is of the form<P1, K1, P2, K2,…,Pq-1, Kq-1, Pq>

where q ≤ p and each Pi is a tree pointer

• Within each internal node, K1 < K2 < …< Kq-1

• Each internal node has at most p tree pointers

• Each internal node, except the root and leaf nodes, has at leastceiling(p/2) tree pointers. The root node has at least two treepointers if it is an internal node.

• An internal node with q pointers, q ≤ p, has q-1 search field

values.



20

Structure of the leaf nodes

1. Each leaf node is of the form

<<K1, Pr1>, <K2, Pr2>, …, <Kq-1, Prq-1>, Pnext>

where q ≤ p, each Pri is a data pointer and Pnext points to the

next leaf node of the B+-tree

• Within each leaf node, K1 < K2 < …< Kq-1; q ≤ p

• Each Pri points to the record whose field value is K i or to a file

block containing the record

1. Each leaf node has at least ceiling(p/2) values.

• All leaf nodes are at the same level

Dynamic Multilevel Indexes Using B+-Trees (contd..) Dynamic Multilevel Indexes Using B+-Trees (contd..)



21

B+ -Tree B+ -Tree



22

Dynamic Multilevel Indexes Using Dynamic Multilevel Indexes Using

B-Trees and B+-Trees B-Trees and B+-Trees

• An insertion into a node that is not full is quite

efficient; if a node is full the insertion causes a split

into two nodes

• Splitting may propagate to other tree levels

• A deletion is quite efficient if a node does not become

less than half full

• If a deletion causes a node to become less than half

full, it must be merged with neighboring nodes



23

FIGURE 14.12

An example of

insertion in a B+-tree with q = 3

and pleaf = 2.



24

FIGURE 14.13

An example of

deletion from aB+-tree.



CSE2147 Database Administration

File Organization and EfficientSearching Mechanisms -

Hashing



CSE2147 2

Storage of Databases

Databases store large amounts of data that must

persist over long periods of time

Data is accessed and processed repeatedly

Most databases are stored permanently on

magnetic disk secondary storage because

High storage capacity (databases are too large to fit

in main memory) Nonvolatile storage

Cost of storage is less



CSE2147 3

Disk Storage Devices

Data stored as magnetized areas on magnetic disksurfaces.

A disk pack contains several magnetic disks

connected to a rotating spindle.

Disks are divided into concentric circular tracks on

each disk surface

Because a track usually contains a large amount of

information, it is divided into smaller blocks or sectors .

Whole blocks are transferred between disk and main

memory for processing.



Unordered Files

•

Also called a heap or a pile file.• New records are inserted at the end of the file.

• A linear search through the file records is necessary to

search for a record.

– This requires reading and searching half the file blocks on

the average, and is hence quite expensive.

• Record insertion is quite efficient.

• Reading the records in order of a particular field

requires sorting the file records.



Ordered Files

•

Also called a sequential file.• File records are kept sorted by the values of an ordering field .

• Insertion is expensive: records must be inserted in the correct order.

–

It is common to keep a separate unordered overflow (ortransaction) file for new records to improve insertion efficiency;this is periodically merged with the main ordered file.

• A binary search can be used to search for a record on its ordering field value.

– This requires reading and searching log2 of the file blocks on theaverage, an improvement over linear search.

• Reading the records in order of the ordering field is quite efficient.

Ordered Files (cont.)



6

Ordered Files (cont.)



Hashing

• Another type of internal search structure

• Provides fast access to records on certainsearch conditions

• Organisation is usually called hash file

• Internal Hashing : for internal files

• External Hashing : for disk files

Hashed Files



Hashed Files• Hashing is typically implemented as a hash table through the

use of an array of records

– E.g. array index range 0 to M-1 then we have M slots

– Hashing function transforms hash field value into an integerbetween 0 to M-1

– Several hashing functions can be used to get the array value

• The problem with most hash functions is that they do notguarantee that distinct values will hash to distinct addresses.

– The number of possible a hash field can take is usually muchlarger than the address space (no of available addresses forrecords)

– Collision occurs when the hash field value of a record that isbeing inserted hashes to an address that already contains a

different record.

h d il ( )



Hashed Files (cont.)

There are numerous methods for collision resolution, including the

following:

•Open addressing: Proceeding from the occupied positionspecified by the hash address, the program checks the subsequentpositions in order until an unused (empty) position is found.

•Chaining: For this method, various overflow locations are kept,usually by extending the array with a number of overflowpositions. In addition, a pointer field is added to each recordlocation. A collision is resolved by placing the new record in anunused overflow location and setting the pointer of the occupied

hash address location to the address of that overflow location.

•Multiple hashing: The program applies a second hash function if the first results in a collision. If another collision results, theprogram uses open addressing or applies a third hash function andthen uses open addressing if necessary.



Internalhashing

datastructures.

(a) Array of M positions

for use in internalhashing. (b) Collisionresolution by chaining

records.

Hashed Files



Hashed Files• Hashing for disk files is called External Hashing

• The file blocks are divided into M equal-sized buckets, numberedbucket0, bucket1, ..., bucketM-1.

– Typically, a bucket corresponds to one (or a fixed number of)disk block.

• One of the file fields is designated to be the hash key of the file.

• The record with hash key value K is stored in bucket i, wherei=h(K), and h is the hashing function.

• Search is very efficient on the hash key.

• Collisions occur when a new record hashes to a bucket that isalready full.

– An overflow file is kept for storing such records.

–

Overflow records that hash to each bucket can be linkedtogether.

Hashed Files (cont )



12

Hashed Files (cont.)



Hashed Files (cont.)• To reduce overflow records, a hash file is typically kept

70-80% full.

• The hash function h should distribute the recordsuniformly among the buckets

– Otherwise, search time will be increased because

many overflow records will exist.

• Main disadvantages of static external hashing:

– Fixed number of buckets M is a problem if thenumber of records in the file grows or shrinks.

– Ordered access on the hash key is quite inefficient(requires sorting the records).

Hashed Files – Overflow



Hashed Files – OverflowHandling

Dynamic And Extendible



Dynamic And ExtendibleHashed Files

• Dynamic and Extendible Hashing Techniques

– Hashing techniques are adapted to allow the dynamicgrowth and shrinking of the number of file records.

– These techniques include the following: dynamichashing, extendible hashing, and linear hashing.

• Both dynamic and extendible hashing use the binaryrepresentation of the hash value h(K) in order toaccess a directory.

– In dynamic hashing the directory is a binary tree.

– In extendible hashing the directory is an array of size 2d

where d is called the global depth.

Dynamic And ExtendibleH hi ( )



Hashing (cont.)• The directories can be stored on disk, and they expand

or shrink dynamically. – Directory entries point to the disk blocks that contain

the stored records.

• An insertion in a disk block that is full causes the block

to split into two blocks and the records are redistributedamong the two blocks.

– The directory is updated appropriately.

•

Dynamic and extendible hashing do not require anoverflow area.

• Linear hashing does require an overflow area but doesnot use a directory.

– Blocks are split in linear order as the file expands.

Extendible



17

Hashing

lecture 3 a b indexing

Documents