sections 13.1 – 13.3 sanuja dabade & eilbroun benjamin cs 257 – dr. ty lin secondary storage...
Post on 19-Dec-2015
219 views
TRANSCRIPT
SECTIONS 13.1 – 13.3Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
SECONDARY STORAGE MANAGEMENT
Presentation Outline
13.1 The Memory Hierarchy 13.1.1 The Memory Hierarchy 13.1.2 Transfer of Data Between
Levels 13.1.3 Volatile and Nonvolatile
Storage 13.1.4 Virtual Memory
13.2 Disks 13.2.1 Mechanics of Disks 13.2.2 The Disk Controller 13.2.3 Disk Access Characteristics
Presentation Outline (con’t)
13.3 Accelerating Access to Secondary Storage 13.3.1 The I/O Model of
Computation 13.3.2 Organizing Data by
Cylinders 13.3.3 Using Multiple Disks 13.3.4 Mirroring Disks 13.3.5 Disk Scheduling and the
Elevator Algorithm 13.3.6 Prefetching and Large-
Scale Buffering
13.1.1 Memory Hierarchy
Several components for data storage having different data capacities available
Cost per byte to store data also varies Device with smallest capacity offer the
fastest speed with highest cost per bit
Memory Hierarchy Diagram
Programs, DBMS
Main Memory DBMS’s
Main Memory
Cache
As Visual Memory Disk File System
Tertiary Storage
13.1.1 Memory Hierarchy
Cache Lowest level of the hierarchy Data items are copies of certain locations of
main memory Sometimes, values in cache are changed
and corresponding changes to main memory are delayed
Machine looks for instructions as well as data for those instructions in the cache
Holds limited amount of data
13.1.1 Memory Hierarchy (con’t) No need to update the data in main
memory immediately in a single processor computer
In multiple processors data is updated immediately to main memory….called as write through
Main Memory
Everything happens in the computer i.e. instruction execution, data manipulation, as working on information that is resident in main memory
Main memories are random access….one can obtain any byte in the same amount of time
Secondary storage
Used to store data and programs when they are not being processed
More permanent than main memory, as data and programs are retained when the power is turned off
E.g. magnetic disks, hard disks
Tertiary Storage
Holds data volumes in terabytes Used for databases much larger than
what can be stored on disk
13.1.2 Transfer of Data Between levels Data moves between adjacent levels of
the hierarchy At the secondary or tertiary levels
accessing the desired data or finding the desired place to store the data takes a lot of time
Disk is organized into bocks Entire blocks are moved to and from
memory called a buffer
13.1.2 Transfer of Data Between level (cont’d) A key technique for speeding up
database operations is to arrange the data so that when one piece of data block is needed it is likely that other data on the same block will be needed at the same time
Same idea applies to other hierarchy levels
13.1.3 Volatile and Non Volatile Storage
A volatile device forgets what data is stored on it after power off
Non volatile holds data for longer period even when device is turned off
All the secondary and tertiary devices are non volatile and main memory is volatile
13.1.4 Virtual Memory
Typical software executes in virtual memory
Address space is typically 32 bit or 232 bytes or 4GB
Transfer between memory and disk is in terms of blocks
13.2.1 Mechanism of Disk
Mechanisms of Disks Use of secondary storage is one of the
important characteristic of DBMS Consists of 2 moving pieces of a disk
1. Disk assembly 2. Head assembly
Disk assembly consists of 1 or more platters Platters rotate around a central spindle Bits are stored on upper and lower surfaces
of platters
13.2.1 Mechanism of Disk
Disk is organized into tracks The track that are at fixed radius from
center form one cylinder Tracks are organized into sectors Tracks are the segments of circle
separated by gap
13.2.2 Disk Controller
One or more disks are controlled by disk controllers
Disks controllers are capable of Controlling the mechanical actuator that
moves the head assembly Selecting the sector from among all those in
the cylinder at which heads are positioned Transferring bits between desired sector and
main memory Possible buffering an entire track
13.2.3 Disk Access Characteristics Accessing (reading/writing) a block requires
3 steps Disk controller positions the head assembly at
the cylinder containing the track on which the block is located. It is a ‘seek time’
The disk controller waits while the first sector of the block moves under the head. This is a ‘rotational latency’
All the sectors and the gaps between them pass the head, while disk controller reads or writes data in these sectors. This is a ‘transfer time’
13.3 Accelerating Access to Secondary Storage Several approaches for more-efficiently
accessing data in secondary storage: Place blocks that are together in the same
cylinder. Divide the data among multiple disks. Mirror disks. Use disk-scheduling algorithms. Prefetch blocks into main memory.
Scheduling Latency – added delay in accessing data caused by a disk scheduling algorithm.
Throughput – the number of disk accesses per second that the system can accommodate.
13.3.1 The I/O Model of Computation The number of block accesses (Disk I/O’s) is
a good time approximation for the algorithm. This should be minimized.
Ex 13.3: You want to have an index on R to identify the block on which the desired tuple appears, but not where on the block it resides. For Megatron 747 (M747) example, it takes
11ms to read a 16k block. A standard microprocessor can execute millions
of instruction in 11ms, making any delay in searching for the desired tuple negligible.
13.3.2 Organizing Data by Cylinders If we read all blocks on a single track or
cylinder consecutively, then we can neglect all but first seek time and first rotational latency.
Ex 13.4: We request 1024 blocks of M747. If data is randomly distributed, average
latency is 10.76ms by Ex 13.2, making total latency 11s.
If all blocks are consecutively stored on 1 cylinder: 6.46ms + 8.33ms * 16 = 139ms
(1 average seek) (time per rotation) (# rotations)
13.3.3 Using Multiple Disks
If we have n disks, read/write performance will increase by a factor of n.
Striping – distributing a relation across multiple disks following this pattern: Data on disk R1: R1, R1+n, R1+2n,… Data on disk R2: R2, R2+n, R2+2n,… … Data on disk Rn: Rn, Rn+n, Rn+2n, …
Ex 13.5: We request 1024 blocks with n = 4. 6.46ms + (8.33ms * (16/4)) = 39.8ms
(1 average seek) (time per rotation) (# rotations)
13.3.4 Mirroring Disks
Mirroring Disks – having 2 or more disks hold identical copied of data.
Benefit 1: If n disks are mirrors of each other, the system can survive a crash by n-1 disks.
Benefit 2: If we have n disks, read performance increases by a factor of n.
Performance increases further by having the controller select the disk which has its head closest to desired data block for each read.
13.3.5 Disk Scheduling and the Elevator Problem
Disk controller will run this algorithm to select which of several requests to process first.
Pseudo code: requests[] // array of all non-processed data
requests upon receiving new data request:
requests[].add(new request) while(requests[] is not empty)
move head to next location if(head location is at data in requests[])
retrieve data remove data from requests[]
if(head reaches end) reverse head direction
13.3.5 Disk Scheduling and the Elevator Problem
(con’t)Events: Head starting
pointRequest data at
8000Request data at
24000Request data at
56000Get data at 8000Request data at
16000Get data at 24000Request data at
64000Get data at 56000Request Data at
40000Get data at 64000Get data at 40000Get data at 16000
data time
Current time
Current time
0
Current time
4.3
Current time
10
Current time
13.6
Current time
20
Current time
26.9
Current time
30
Current time
34.2
Current time
45.5
Current time
56.8
800016000240003200040000480005600064000
data time
8000.. 4.3
data time
8000.. 4.3
24000.. 13.6
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
16000.. 56.8
13.3.5 Disk Scheduling and the Elevator Problem
(con’t)
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
64000.. 34.2
40000.. 45.5
16000.. 56.8
data time
8000.. 4.3
24000.. 13.6
56000.. 26.9
16000.. 42.2
64000.. 59.5
40000.. 70.8
Elevator Algorithm
FIFOAlgorithm
13.3.6 Prefetching and Large-Scale Buffering If at the application level, we can predict
the order blocks will be requested, we can load them into main memory before they are needed.
DISK FAILURES
Xiaqing He
ID: 204
Dr. Lin
Content1)Focus on : “How to recover from disk crashes” common term RAID “redundancy array of independent
disks”2)Several schemes to recover from disk
crashes: Mirroring—RAID level 1; Parity checks--RAID 4; Improvement--RAID 5; RAID 6;
1) Mirroring The simplest scheme to recovery from
Disk Crashes How does Mirror work? -- making two or more copied of the
data on different disks Benefit: -- save data in case of one disk will
fail; -- divide data on several disks and
let access to several blocks at once
1) Mirroring (con’t)
For mirroring, when the data can be lost? -- the only way data can be lost if there is a second
(mirror/redundant) disk crash while the first (data) disk crash is being repaired.
Possibility:Suppose: • One disk: mean time to failure = 10 years;• One of the two disk: average of mean time to failure = 5 years;• The process of replacing the failed disk= 3 hours=1/2920 year;So: • the possibility of the mirror disk will fail=1/10 * 1/2,920
=1/29,200;• The possibility of data loss by mirroring: 1/5 * 1/29,200 =
1/146,000
2)Parity Blocks why changes? -- disadvantages of Mirroring: uses so many
redundant disks What’s new? -- RAID level 4: uses only one redundant disk
How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-
2 sum of the jth bits of all the data disks. Example
2)Parity Blocks(con’t)___ExampleData disks: Disk1: 11110000 Disk2: 10101010 Disk3: 00111000
Redundant disk: Disk4: 01100010
2)RAID 4 (con’t) Reading -- Similar with reading blocks from any disk;
Writing 1)change the data disk; 2)change the corresponding block of the
redundant disk;
• Why? -- hold the parity checks for the
corresponding blocks of all the data disks
2)RAID 4 (con’t) _ writing
For a total N data disks:1) naïve way:• read N data disks and compute the modulo-2
sum of the corresponding blocks;• rewrite the redundant disk according to
modulo-2 sum of the data disks;
2) better way:• Take modulo-2 sum of the old and new version
of the data block which was rewritten;• Change the position of the redundant disk
which was 1’s in the modulo-2 sum;
2)RAID 4 (con’t) Data disks:• Disk1: 11110000• Disk2: 10101010 01100110• Disk3: 00111000
to do:• Modulo-2 sum of the old and new version of disk 2:
11001100• So, we need to change the positions 1,2,5,6 of the
redundant disk.
Redundant disk:• Disk4: 01100010 10101110
2)RAID 4 (con’t) _failure recovery
Redundant disk crash:-- swap a new one and recomputed data from all the data
disks;
One of Data disks crash: -- swap a new one;-- recomputed data from the other disks including data disks
and redundant disk;
How to recomputed? (same rule, that’s why there will be some improvement)
-- take modulo-2 sum of all the corresponding bits of all the other disks
3) An Improvement: RAID 5 Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect
(when updating data disk need to read and write the redundant disk);
Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks;
Why it is feasible?The rule of failure recovery for redundant disk and data disk is the
same:
“take modulo-2 sum of all the corresponding bits of all the other disks”
So, there is no need to retreat one as redundant disk and others as data disks
3) RAID 5 (con’t) How to recognize which blocks of each disk
treat this disk as redundant disk?
-- if there are n+1 disks which were labeled from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1;
Example;
3) RAID 5 (con’t)_example N=3;• The first disk, labeled as 0 : 4,8,12…;• The second disk, labeled as 1 : 1,5,9…;• The third disk, labeled as 2 : 2,6,10…;• ……….
Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written:
• 1/4 + 3 /4 * 1/3 =1/2• If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m
4) Coping with multiple disk crashes
RAID 6 – deal with any number of disk crashes if using
enough redundant disks Example a system of seven disks ( four data
disks_numer 1-4 and 3 redundant disks_ number 5-7);
• How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks)1)every column values three 1’s and 0’s except
for all three 0’s;2) column of the redundant disk has single 1’s;3) column of the data disk has at least two 1’s;
4) Coping with multiple disk crashes (con’t)
Reading: read form the data disks and ignore the
redundant disk
Writing: Change the data disk change the corresponding bits of all the
redundant disks
4) Coping with multiple disk crashes (con’t)
In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes?
• Suppose disk a and b failed:• find some row r (in 3*7 matrix)in which the column for
a and b are different (suppose a is 0’s and b is 1’s);• Compute the correct b by taking modulo-2 sum of the
corresponding bits from all the other disks other than b which have 1’s in row r;
• After getting the correct b, Compute the correct a with all other disks available;
Example
4) Coping with multiple disk crashes (con’t)_example
3*7 matrix data disk
redundant disk
disk number 1 2 3 4 5 6 7
1 1 1 0 1 0 0
1 1 0 1 0 1 0
1 0 1 1 0 0 1
4) Coping with multiple disk crashes (con’t)_example
First block of all the disks disk contents
1) 11110000
2) 10101010
3) 00111000
4) 01000001
5) 01100010
6) 00011011
7) 10001001
4) Coping with multiple disk crashes (con’t)_example
Two disks crashes; disk contents
1) 11110000
2) ?????????
3) 00111000
4) 01000001
5) ?????????
6) 00011011
7) 10001001
4) Coping with multiple disk crashes (con’t)_example
In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0.
so: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6;
then compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3;
1) 11110000
2) ????????? => 00001111
3) 00111000
4) 01000001
5) ????????? => 01100010
6) 00011011
7) 10001001
13.5 Arranging data on disk13.5 Arranging data on disk
Meghna Jain Meghna Jain ID-205ID-205CS257CS257
Prof: Dr. T.Y.LinProf: Dr. T.Y.Lin
Data elements are represented as records, which stores Data elements are represented as records, which stores in consecutive bytes in same same disk block. in consecutive bytes in same same disk block.
Basic layout techniques of storing data : Basic layout techniques of storing data :
Fixed-Length RecordsFixed-Length Records
Allocation criteria - data should start at word boundary.Allocation criteria - data should start at word boundary.
Fixed Length record header Fixed Length record header
1. A pointer to record schema.1. A pointer to record schema.2. The length of the record.2. The length of the record.
3. Timestamps to indicate last modified or last read.3. Timestamps to indicate last modified or last read.
Example Example
CREATE TABLE employee(CREATE TABLE employee(
name CHAR(30) PRIMARY KEY,name CHAR(30) PRIMARY KEY,
address VARCHAR(255),address VARCHAR(255),
gender CHAR(1),gender CHAR(1),
birthdate DATEbirthdate DATE
););
Data should start at word boundary and contain header and Data should start at word boundary and contain header and four fields name, address, gender and birthdate.four fields name, address, gender and birthdate.
Packing Fixed-Length Records into Blocks :Packing Fixed-Length Records into Blocks :
Records are stored in the form of blocks on the Records are stored in the form of blocks on the disk and they move into main memory when we disk and they move into main memory when we need to update or access them.need to update or access them.
A block header is written first, and it is followed by A block header is written first, and it is followed by series of blocks.series of blocks.
Block header contains the following information :Block header contains the following information :
Links to one or more blocks that are part of a Links to one or more blocks that are part of a network of blocks.network of blocks.
Information about the role played by this block Information about the role played by this block in such a network.in such a network.
Information about the relation, the tuples in Information about the relation, the tuples in this block belong to.this block belong to.
A "directory" giving the offset of each record in A "directory" giving the offset of each record in the block.the block.
Time stamp(s) to indicate time of the block's Time stamp(s) to indicate time of the block's last modification and/or access.last modification and/or access.
Example
Along with the header we can pack as many record as we can
in one block as shown in the figure and remaining space will
be unused.
13.6 REPRESENTING BLOCK AND RECORD ADDRESSES
Ramya KarriCS257 Section 2ID: 206
Introduction
Address of a block and Record In Main Memory
Address of the block is the virtual memory address of the first byte
Address of the record within the block is the virtual memory address of the first byte of the record
In Secondary Memory: sequence of bytes describe the location of the block in the overall system
Sequence of Bytes describe the location of the block : the device Id for the disk, Cylinder number, etc.
Addresses in Client-Server Systems
The addresses in address space are represented in two ways Physical Addresses: byte strings that determine the
place within the secondary storage system where the record can be found.
Logical Addresses: arbitrary string of bytes of some fixed length
Physical Address bits are used to indicate: Host to which the storage is attached Identifier for the disk Number of the cylinder Number of the track Offset of the beginning of the record
Map Table relates logical addresses to physical addresses.
Logical Physical
Logical Address
Physical Address
ADDRESSES IN CLIENT-SERVER SYSTEMS (CONTD..)
Logical and Structured Addresses
Purpose of logical address? Gives more flexibility, when we
Move the record around within the block
Move the record to another block Gives us an option of deciding what
to do when a record is deleted?Record 4
Record 3
Record 2
Record 1
Header
Offset table
Unused
Pointer Swizzling
Having pointers is common in an object-relational database systems
Important to learn about the management of pointers
Every data item (block, record, etc.) has two addresses: database address: address on the disk memory address, if the item is in virtual
memory
POINTER SWIZZLING (CONTD…)
Translation Table: Maps database address to memory address
All addressable items in the database have entries in the map table, while only those items currently in memory are mentioned in the translation table
Dbaddr Mem-addrDatabase address
Memory Address
POINTER SWIZZLING (CONTD…)
Pointer consists of the following two fields Bit indicating the type of address Database or memory address Example 13.17Disk
Block 2
Block 1
Memory
Swizzled
Unswizzled
Block 1
Example 13.7
Block 1 has a record with pointers to a second record on the same block and to a record on another block
If Block 1 is copied to the memory The first pointer which points within
Block 1 can be swizzled so it points directly to the memory address of the target record
Since Block 2 is not in memory, we cannot swizzle the second pointer
POINTER SWIZZLING (CONTD…)
Three types of swizzling Automatic Swizzling
As soon as block is brought into memory, swizzle all relevant pointers.
Swizzling on Demand Only swizzle a pointer if and when it is
actually followed. No Swizzling
Pointers are not swizzled they are accesses using the database address.
Programmer Control of Swizzling
Unswizzling When a block is moved from memory
back to disk, all pointers must go back to database (disk) addresses
Use translation table again Important to have an efficient data
structure for the translation table
A block in memory is said to be pinned if it cannot be written back to disk safely.
If block B1 has swizzled pointer to an item in block B2, then B2 is pinned Unpin a block, we must unswizzle any
pointers to it Keep in the translation table the places in
memory holding swizzled pointers to that item
Unswizzle those pointers (use translation table to replace the memory addresses with database (disk) addresses
Pinned records and Blocks
VARIABLE LENGTH DATA AND RECORDS
Eswara Satya Pavan Rajesh Pinapala
CS 257
ID: 221
Topics
Records with Variable Length Fields Records with Repeating Fields Variable Format Records Records that do not fit in a block BLOBS
name address gender birth date
ExampleExample
Fig 1 : Movie star record with four fields
0 30 286 287 297
Records with Variable Fields
An effective way to represent variable length records is as follows
Fixed length fields are Kept ahead of the variable length fields
Record header contains• Length of the record• Pointers to the beginning of all variable length fields except the first one.
Records with Variable Length Fields
birth date name address
header informationrecord length
to address
gender
Figure 2 : A Movie Star record with name and address implemented as variable length character strings
Records with Repeating Fields
Records contains variable number of occurrences of a field F
All occurrences of field F are grouped together and the record header contains a pointer to the first occurrence of field F
L bytes are devoted to one instance of field F
Locating an occurrence of field F within the record • Add to the offset for the field F which are the integer multiples of L starting with 0 , L ,2L,3L and so on to locate•We stop upon reaching the offset of the field F.
Records with Repeating Fields
name address
other header informationrecord length
to addressto movie pointers
pointers to movies
Figure 3 : A record with a repeating group of references to movies
Figure 4 : Storing variable-length fields separately from the record
address
name
record header information
length of nameto address
length of address
to name
to movie references number of
references
Records with Repeating Fields
Advantage
Keeping the record itself fixed length allows record to be searched more efficiently, minimizes the overhead in the block headers, and allows records to be moved within or among the blocks with minimum effort.
Disadvantage
Storing variable length components on another block increases the number of disk I/O’s needed to examine all components of a record.
Records with Repeating Fields
A compromise strategy is to allocate a fixed portion of the record for the repeating fields
If the number of repeating fields is lesser than allocated space, then there will be some unused space If the number of repeating fields is greater than allocated space, then extra fields are stored in a different location and
Pointer to that location and count of additional occurrences is stored in the record
Records with Repeating Fields
Variable Format Records Records that do not have fixed schema
Variable format records are represented by sequence of tagged fields
Each of the tagged fields consist of information• Attribute or field name• Type of the field• Length of the field• Value of the field
Why use tagged fields• Information – Integration applications• Records with a very flexible schema
Variable Format Records
Fig 5 : A record with tagged fields
N 16
S S14
Clint Eastwood
Hog’s Breath InnR
code for name code for restaurant ownedcode for string
typecode for string type
lengthlength
Records that do not fit in a block
When the length of a record is greater than block size ,then then record is divided and placed into two or more blocks
Portion of the record in each block is referred to as a RECORD FRAGMENT
Record with two or more fragments is called SPANNED RECORD
Record that do not cross a block boundary is called UNSPANNED RECORD
Spanned records require the following extra header information
• A bit indicates whether it is fragment or not
• A bit indicates whether it is first or last fragment of a record
• Pointers to the next or previous fragment for the same record
Spanned Records
Records that do not fit in a block
Figure 6 : Storing spanned records across blocks
record 1 record 3 record 2 - a
record 2 - b
block header
record header
block 1 block 2
BLOBS
Large binary objects are called BLOBS e.g. : audio files, video files
Storage of BLOBS
Retrieval of BLOBS
83
Record Modifications
Chapter 13
Section 13.8
Neha SamantCS 257
(Section II) Id 222
Modification types
Insertion
Deletion
Update
84
Insertion
85
Insertion of records without order
Records can be placed in a block with empty space or in a new block.
Insertion of records in fixed order Space available in the block No space available in the block (outside the block)
Structured addressPointer to a record from outside the block.
Insertion in fixed order
Space available within the block Use of an offset table in the header of each block with pointers to
the location of each record in the block. The records are slid within the block and the pointers in the offset
table are adjusted.
86
Record 2Record 3Record 4
header unused
Offset table
Record 1
Insertion in fixed order
87
No space available within the block (outside the block) Find space on a “nearby” block.
• In case of no space available on a block, look at the following block in sorted order of blocks.
• If space is available in that block ,move the highest records of first block 1 to block 2 and slide the records around on both blocks.
Create an overflow block• Records can be stored in overflow block.• Each block has place for a pointer to an overflow block in its header.• The overflow block can point to a second overflow block as shown below.
Block B
Overflow block for B
Deletion
Recover space after deletion When using an offset table, the records can be slid around the
block so there will be an unused region in the center that can be recovered.
In case we cannot slide records, an available space list can be maintained in the block header.
The list head goes in the block header and available regions hold the links in the list.
88
Deletion
89
Use of tombstone The tombstone is placed in a record in order to avoid pointers to the
deleted record to point to new records.
The tombstone is permanent until the entire database is reconstructed.
If pointers go to fixed locations from which the location of the record is found then we put the tombstone in that fixed location. (See examples)
Where a tombstone is placed depends on the nature of the record pointers.
Map table is used to translate logical record address to physical address.
Deletion
90
Record 1 Record 2
Use of tombstone If we need to replace records by tombstones, place the bit that serves
as the tombstone at the beginning of the record.
This bit remains the record location and subsequent bytes can be reused for another record
Record 1 can be replaced, but the tombstone remains, record 2 has no tombstone and can be seen when we follow a pointer to it.
Update
Fixed Length update No effect on storage system as it occupies same
space as before update.
Variable length update Longer length Short length
91
Update
Variable length update (longer length) Stored on the same block:
Sliding records Creation of overflow block.
Stored on another block Move records around that block Create a new block for storing variable length fields.
92
Update
Variable length update (Shorter length) Same as deletion
Recover space Consolidate space.
93
BTREES & BITMAP BTREES & BITMAP INDEXESINDEXES
14.2
DATABASE SYSTEMS – The Complete Book
Presented By: Under the supervision of:Maciej Kicinski Dr. T.Y.Lin
B TreesB Trees ►► ►►
StructureStructure
A balanced tree, meaning that all paths from the
leaf node have the same length. There is a parameter n associated with each Btree
block. Each block will have space for n searchkeys
and n+1 pointers. The root may have only 1 parameter, but all other
blocks most be at least half full.
StructureStructure
● A typical node >● a typical interior node would havepointers pointing toleaves with outvalues● a typical leaf wouldhave pointers pointto recordsN search keysN+1 pointers
ApplicationApplication
The search key of the Btree is the primary key for the data file.
Data file is sorted by its primary key. Data file is sorted by an attribute that is not a
key,and this attribute is the search key for the Btree.
LookupLookup
If at an interior node, choose the correct pointer to use. This is done by comparing keys to search value.
LookupLookup
If at a leaf node, choose the key that matches what
you are looking for and the pointer for that leads
to the data.
InsertionInsertion
When inserting, choose the correct leaf node to put pointer to data.
If node is full, create a new node and split keysbetween the two.
Recursively move up, if cannot create new pointer to new node because full, create new node.
This would end with creating a new root node, ifthe current root was full.
DeletionDeletion
Perform lookup to find node to delete and delete it.
If node is no longer half full, perform join on adjacent node and recursively delete up, or key move if that node is full and recursively change
pointer up.
EfficiencyEfficiency
Btrees allow lookup, insertion, and deletion of records using very few disk I/Os.
Each level of a Btree would require one read. Then you would follow the pointer of that to the next or
final read.
EfficiencyEfficiency
Three levels are sufficient for Btrees. Having each block have 255 pointers, 255^3 is about 16.6 million.
You can even reduce disk I/Os by keeping a level of a Btree in main memory. Keeping the first block with 255 pointers
would reduce the reads to 2, and even possible to keep the next 255 pointers in memory to reduce reads to 1.
ReferencesReferences
[1] Database Systems : The Complete Book - Hector Garcia-Molina, Jeffrey D. Ullman, Jennifer D. Widom
[2] http://en.wikipedia.org/wiki/Bitmap_index#Example
[3] faculty.kfupm.edu.sa/ICS/adam/ICS541/L10-md-bitmap-indexing.ppt
[4] http://csis.bits-pilani.ac.in/faculty/goel/Data%20Warehousing/Lecture%20Notes/Lecture%20%239%20-%20Bitmap%20Indexes%20in%20DW.doc (- a good doc file to read the concepts of bitmap indexes)
CONCURRENCY CONTROL18.1 – 18.2
Chiu Luk CS257 Database Systems PrinciplesSpring 2009
Concurrency Control
Concurrency control in database management systems (DBMS) ensures that database transactions are performed concurrently without the concurrency violating the data integrity of a database.
Executed transactions should follow the ACID rules. The DBMS must guarantee that only serializable (unless Serializability is intentionally relaxed), recoverable schedules are generated.
It also guarantees that no effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related database.
Transaction ACID rules
Atomicity - Either the effects of all or none of its operations remain when a transaction is completed - in other words, to the outside world the transaction appears to be indivisible, atomic.
Consistency - Every transaction must leave the database in a consistent state. Isolation - Transactions cannot interfere with each other. Providing isolation is the main goal of concurrency control.
Durability - Successful transactions must persist through crashes.
Serial and Serializable Schedules
In the field of databases, a schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of transactions. In this example, Schedule D is the set of 3 transactions T1, T2, T3. The schedule describes the actions of the transactions as seen by the DBMS. T1 Reads and writes to object X, and then T2 Reads and writes to object Y, and finally T3 Reads and writes to object Z. This is an example of a serial schedule, because the actions of the 3 transactions are not interleaved.
Serial and Serializable Schedules
A schedule that is equivalent to a serial schedule has the serializability property.
In schedule E, the order in which the actions of the transactions are executed is not the same as in D, but in the end, E gives the same result as D.
Serial Schedule TI precedes T2
T1 T2
Read(A); A A+100
Write(A);
Read(B); B B+100;
Write(B);
Read(A);A A2;
Write(A);
Read(B);B B2;
Write(B);
A B25 25
125
125
250
250250 250
Serial Schedule T2 precedes TlT1 T2
Read(A);A A2;
Write(A);
Read(B);B B2;
Write(B);
Read(A); A A+100
Write(A);
Read(B); B B+100;
Write(B);
A B25 25
50
50
150
150150 150
serializable, but not serial, schedule
T1 T2
Read(A); A A+100
Write(A);
Read(A);A A2;
Write(A);
Read(B); B B+100;
Write(B);
Read(B);B B2;
Write(B);
A B25 25
125
250
125
250250 250
r1(A); w1 (A): r2(A); w2(A); r1 (B); w1 (B); r2(B); w2(B);
nonserializable scheduleT1 T2
Read(A); A A+100
Write(A);
Read(A);A A2;
Write(A);
Read(B);B B2;
Write(B);
Read(B); B B+100;
Write(B);
A B25 25
125
250
50
150250 150
schedule that is serializable only because of the detailed behavior of the transactions
T1 T2’
Read(A); A A+100
Write(A);
Read(A);A A1;
Write(A);
Read(B);B B1;
Write(B);
Read(B); B B+100;
Write(B);
regardless of the consistent initial state: the final state will be consistent.
A B25 25
125
125
25
125125 125
Non-Conflicting Actions
Two actions are non-conflicting if whenever theyoccur consecutively in a schedule, swapping themdoes not affect the final state produced by theschedule. Otherwise, they are conflicting.
Conflicting Actions: General Rules Two actions of the same transaction
conflict: r1(A) w1(B)
Two actions over the same database element conflict, if one of them is a write r1(A) w2(A) w1(A) w2(A)
Conflict actions
Two or more actions are said to be in conflict if: The actions belong to different transactions. At least one of the actions is a write operation. The actions access the same object (read or write).
The following set of actions is conflicting: T1:R(X), T2:W(X), T3:W(X)
While the following sets of actions are not: T1:R(X), T2:R(X), T3:R(X) T1:R(X), T2:W(Y), T3:R(X)
Conflict Serializable
We may take any schedule and make as many nonconflicting swaps as we wish.
With the goal of turning the schedule into a serial schedule.
If we can do so, then the original schedule is serializable, because its effect on the database state remains the same as we perform each of the nonconflictingswaps.
Conflict Serializable A schedule is said to be conflict-serializable when the schedule is conflict-
equivalent to one or more serial schedules. Another definition for conflict-serializability is that a schedule is conflict-
serializable if and only if there exists an acyclic precedence graph/serializability graph for the schedule.
Which is conflict-equivalent to the serial schedule <T1,T2>, but not <T2,T1>.
Conflict equivalent / conflict-serializable
Let Ai and Aj are consecutive non-conflicting actions that belongs to different transactions. We can swap Ai and Aj without changing the result.
Two schedules are conflict equivalent if they can be turned one into the other by a sequence of non-conflicting swaps of adjacent actions.
We shall call a schedule conflict-serializable if it is conflict-equivalent to a serial schedule.
conflict-serializable
T1 T2
R(A)
W(A)
R(A)
R(B)
W(A)
W(B)
R(B)
W(B)
conflict-serializable
T1 T2
R(A)
W(A)
R(B)
R(A)
W(A)
W(B)
R(B)
W(B)
conflict-serializable
T1 T2
R(A)
W(A)
R(A)
R(B)
W(B)
W(A)
R(B)
W(B)
conflict-serializable
T1 T2
R(A)
W(A)
R(A)
W(B)
R(B)
W(A)
R(B)
W(B)
SerialSchedule
References Database Systems: The Complete Book (2nd Edition) (Hardcover) by
Hector Garcia-Molina (Author), Jeffrey D. Ullman (Author), Jennifer Widom (Author) Publisher : Prenctice Hall.
http://en.wikipedia.org/wiki/Concurrency_control http://www.utdallas.edu/~mxk055100/db07files/serilizable-defs.ppt http://en.wikipedia.org/wiki/Schedule_(computer_science)#Serializable http://www.cs.duke.edu/~shivnath/courses/fall06/Lectures/11_serial.ppt
127
CONCURRENCY CONTROL
By Donavon Norwood
Ankit Patel
Aniket Mulye
128
INTRODUCTION
Enforcing serializability by locks Locks Locking scheduler Two phase locking
Locking systems with several lock modes Shared and exclusive locks Compatibility matrices Upgrading/updating locks Incrementing locks
129
Locks
It works like as follows : A request from transaction Scheduler checks in the lock table Generates a serializable schedule of actions.
130
Consistency of transactions
Actions and locks must relate each other Transactions can only read & write only if
has a lock and has not released the lock. Unlocking an element is compulsory.
Legality of schedules No two transactions can aquire the lock on
same element without the prior one releasing it.
131
Locking scheduler
Grants lock requests only if it is in a legal schedule.
Lock table stores the information about current locks on the elements.
132
The locking scheduler (contd.) A legal schedule of consistent transactions
but unfortunately it is not a serializable.
133
Locking schedule (contd.)
The locking scheduler delays requests that would result in an illegal schedule.
134
Two-phase locking
Guarantees a legal schedule of consistent transactions is conflict-serializable.
All lock requests proceed all unlock requests.
The growing phase: Obtain all the locks and no unlocks allowed.
The shrinking phase: Release all the locks and no locks allowed.
135
Working of Two-Phase locking Assures serializability. Two protocols for 2PL:
Strict two phase locking : Transaction holds all its exclusive locks till commit / abort.
Rigorous two phase locking : Transaction holds all locks till commit / abort.
Possible to find a transaction Tj that has a 2PL and a schedule S for Ti ( non 2PL ) and Tj that is not conflict serializable.
136
Failure of 2PL.
2PL fails to provide security against deadlocks.
Locking Systems with Several Lock Modes Locking Scheme
Shared/Read Lock ( For Reading) Exclusive/Write Lock( For Writing)
Compatibility Matrices Upgrading Locks Update Locks Increment Locks
137
Shared & Exclusive Locks
Consistency of Transactions Cannot write without Exclusive Lock Cannot read without holding some lock
This basically works on 2 principles A read action can only proceed a shared or an
exclusive lock A write lock can only proceed a exclusice lock
All locks need to be unlocked before commit
138
Shared and exclusive locks (cont.)
Two-phase locking of transactions Must precede unlocking
Legality of Schedules An element may be locked exclusively by one
transaction or by several in shared mode, but not both.
139
Compatibility Matrices
Has a row and column for each lock mode. Rows correspond to a lock held on an
element by another transaction Columns correspond to mode of lock
requested. Example :
140
LOCK REQUESTED
S X
LOCK S YES NO
HOLD X NO NO
Upgrading Locks
Suppose a transaction wants to read as well as write : It aquires a shared lock on the element Performs the calculations on the element And when its ready to write, It is granted a
exclusive lock. Transactions with unpredicted read write
locks can use UPGRADING LOCKS.
141
Upgrading locks (cont.)
Indiscriminating use of upgrading produces a deadlock.
Example : Both the transactions want to upgrade on the same element
142
Update locks
Solves the deadlock occurring in upgrade lock method.
A transaction in an update lock can read but cant write.
Update lock can later be converted to exclusive lock.
An update lock can only be given if the element has shared locks.
143
Update locks (cont.)
An update lock is like a shared lock when you are requesting it and is like a exclusive lock when you have it.
Compatibility matrix :
144
S X U
S YES NO YES
X NO NO NO
U NO NO NO
Increment Locks
Used for incrementing & decrementing stored values.
E.g. - Transfer money from one bank to another, Ticket selling transactions in which number seats are decremented after each transaction.
145
Increment lock (cont.)146
A increment lock does not enable read or write locks on element.
Any number of transaction can hold increment lock on element
Shared and exclusive locks can not be granted if an increment lock is granted on element
S X I
S YES NO NO
X NO NO NO
I NO NO YES
Refrences
Database systems : The complete book Hector Garcia, Jeffrey Ullman, Jennifer
Widom Topics 18.3 , 18.4
147
148
CONCURRENCY CONTROL: 18.4
LOCKING SYSTEMS WITH SEVERAL LOCK MODES
CS257 Spring/2009
Professor: Tsau Lin
Student: Suntorn Sae-Eung
ID: 212
18.4 Locking Systems with Several Lock Modes
In 18.3, if a transaction must lock a database element (X) either reads or writes, No reason why several transactions could not
read X at the same time, as long as none write X
Introduce locking schemes Shared/Read Lock ( For Reading) Exclusive/Write Lock( For Writing)
149
18.4.1 Shared & Exclusive Locks
Transaction Consistency Cannot write without Exclusive Lock Cannot read without holding some lock Consider lock for writing is “stronger” than for
reading
This basically works on 2 principles1. A read action can only proceed a shared or an
exclusive lock2. A write lock can only proceed a exclusive lock
All locks need to be unlocked before commit
150
18.4.1 Shared & Exclusive Locks (cont.)
Two-phase locking (2PL) of transactions
Ti
Notation: sli (X)– Ti requests shared lock on DB element X
xli (X)– Ti requests exclusive lock on DB element X
ui (X)– Ti relinquishes whatever lock on X
151
Lock R/W Unlock
18.4.1 Shared & Exclusive Locks (cont.)
Legality of Schedules An element may be locked by: one write
transaction or by several read transactions shared mode, but not both
18.4.2 Compatibility Matrices A convenient way to describe lock-
management policies Rows correspond to a lock held on an
element by another transaction Columns correspond to mode of lock
requested. Example :
153
Lock requested
S X
Lock inhold
S YES NO
X NO NO
18.4.3 Upgrading Locks
A transaction (T) taking a shared lock is friendly toward other transaction.
When T wants to read and write a new value
X,
1. T takes a shared lock on X.
2. performs operations on X (may spend long time)
3. When T is ready to write a new value, “Upgrade” shared lock to exclusive lock on X.
154
18.4.3 Upgrading Locks (cont.)
Observe the example
T1 cannot take an exclusive lock on B until all locks on
B are released.
‘B’ is releasedT1 retry and
succeed
18.4.3 Upgrading Locks (cont.) Upgrading can simply cause a
“Deadlock”. Both the transactions want to upgrade on
the same element
156
Both transactions will wait forever !!
18.4.4 Update locks
The third lock mode resolving the deadlock problem, which rules are Only “Update lock” can be upgraded to a
write (exclusive) lock later. An “Update lock” is allowed to grant on X
when there are already shared locks on X. Once there is an “Update lock,” it
prevents additional any kinds of lock, and later changes to a write (exclusive) lock.
Notation: uli (X)
157
18.4.4 Update locks (cont.)
Example
18.4.4 Update locks (cont.)
• Compatibility matrix (asymmetric)
159
Lock requestedS X U
Lock inhold
S YES NO YES
X NO NO NO
U NO NO NO
18.4.5 Increment Locks
A useful lock for transactions which increase/decrease value. e.g. money transfer between two bank accounts.
If 2 transactions (T1, T2) add constants to the same database element (X), It doesn’t matter which goes first, but no reads
are allowed in between transaction processing Let see on following exhibits
160
18.4.5 Increment Locks (cont.)
A=5
A=15
A=17
A=7T1: INC (A,2)
T1: INC (A,2)
T2: INC (A,10)
T2: INC (A,10)
CASE 1
CASE 2
18.4.5 Increment Locks (cont.) What if
A=5
A=15
A=15
A=7T1: INC (A,2)
T1: INC (A,2)
T2: INC (A,10)
T2: INC (A,10)
A=5 A=7
A != 17
A=5
A=5
18.4.5 Increment Locks (cont.) INC (A, c) –
Increment action of writing on database element A, which is an atomic execution consisting of1. READ(A,t);2. t = t+c;3. WRITE(A,t);
Notation: ili (X)– action of Ti requesting an increment lock on X inci (X)– action of Ti increments X by some constant;
don’t care about the value of the constant.
18.4.5 Increment Locks (cont.) Example
18.4.5 Increment Locks (cont.)
165
• Compatibility matrix
Lock requestedS X I
Lock inhold
S YES NO NO
X NO NO NO
I NO NO YES
References
H. Garcia-Molina, J. Ullman, and J. Widom, “Database System: The Complete Book,” second edition: chapter 18.3-18.4, p.897-913, Prentice Hall, New Jersy, 2008
166
167Concurrency Control
Chapter 18
Section 18.5
Presented by Khadke, Suvarna
CS 257 (Section II) Id 213
Overview
Assume knowledge of: Lock Two phase lock Lock modes: shared, exclusive, update
A simple scheduler architecture based on following principle : Insert lock actions into the stream of reads,
writes, and other actions Release locks when the transaction manager
tells it that the transaction will commit or abort
168
Scheduler That Inserts Lock Actions into the transactions request stream
Scheduler That Inserts Lock ActionsIf transaction is delayed, waiting for a lock,
Scheduler performs following actions Part I: Takes the stream of requests generated
by the transaction & insert appropriate lock modes to db operations (read, write, or update)
Part II: Take actions (a lock or db operation) from Part I and executes it.
Determine the transaction (T) that action belongs and status of T (delayed or not). If T is not delayed then
1. Database access action is transmitted to the database and executed
170
Scheduler That Inserts Lock Actions
2. If lock action is received by PartII, it checks the L Table whether lock can be granted or not
i> Granted, the L Table is modified to include granted lock
ii>Not G. then update L Table about requested lock then PartII delays transaction T
3. When a T = commits or aborts, PartI is notified by the transaction manager and releases all locks.
171
The Lock Table
A relation that associates database elements with locking information about that element
Implemented with a hash table using database elements as the hash key
Size is proportional to the number of lock elements only, not to the size of the entire database
172
DB element A Lock
information for A
Lock Table Entries Structure173
Some Sort of information found in Lock Table entry 1>Group modes-S: only shared locks are held-X: one exclusive lock and no other locks- U: one update lock and one or more shared locks2>wait : one transaction waiting for a lock on A3>A list : T currently hold locks on A or Waiting for lock on A
Handling Lock Requests
Suppose transaction T requests a lock on A
If there is no lock table entry for A, then there are no locks on A, so create the entry and grant the lock request
If the lock table entry for A exists, use the group mode to guide the decision about the lock request
174
Handling Lock Requests175
If group mode is U (update) or X (exclusive)
No other lock can be granted Deny the lock request by T Place an entry on the list saying T requests a lock And Wait? = ‘yes’
If group mode is S (shared)
Another shared or update lock can be granted Grant request for an S or U lock Create entry for T on the list with Wait? = ‘no’ Change group mode to U if the new lock is an
update lock
Handling Unlock Requests176
Now suppose transaction T unlocks A Delete T’s entry on the list for A If T’s lock is not the same as the group
mode, no need to change group mode Otherwise check entire list for new group
modeS: GM(S) or nothingU: GM(S) or nothingX: nothing
Handling Unlock Requests177
If the value of waiting is “yes" need to grant one or more locks using following approachesFirst-Come-First-Served: Grant the lock to the longest waiting request. No starvation (waiting forever for lock)Priority to Shared Locks: Grant all S locks waiting, then one U lock. Grant X lock if no others waitingPriority to Upgrading: If there is a U lock waiting to upgrade to an X lock, grant that first.
Reference List
ULLMAN, J. D., WISDOM J. & HECTOR G., DATABASE SYSTEMS THE COMPLETE BOOK, 2nd Edition, 2008.
178
Concurrency Control179
Managing Hierarchies of Database Elements (18.6)
Presented by
Ronak Shah
(214)
March 9, 2009
Managing Hierarchies of Database Elements
Two problems that arise with locks when there is a tree structure to the data are:
When the tree structure is a hierarchy of lockable elements Determine how locks are granted for both
large elements (relations) and smaller elements (blocks containing tuples or individual tuples)
When the data itself is organized as a tree (B-tree indexes) This will be discussed in the next section
Locks with Multiple Granularity A database element can be a relation,
block or a tuple Different systems use different database
elements to determine the size of the lock
Thus some may require small database elements such as tuples or blocks and others may require large elements such as relations
Example of Multiple Granularity Locks
Consider a database for a bank Choosing relations as database elements means
we would have one lock for an entire relation If we were dealing with a relation having account
balances, this kind of lock would be very inflexible and thus provide very little concurrency
Why? Because balance transactions require exclusive locks and this would mean only one transaction occurs for one account at any time
But as each account is independent of others we could perform transactions on different accounts simultaneously
…(contd.)
Thus it makes sense to have block element for the lock so that two accounts on different blocks can be updated simultaneously
Another example is that of a document With similar arguments as above, we see
that it is better to have large element (a complete document) as the lock in this case
Warning (Intention) Locks
These are required to manage locks at different granularities In the bank example, if the a shared lock is
obtained for the relation while there are exclusive locks on individual tuples, unserializable behavior occurs
The rules for managing locks on hierarchy of database elements constitute the warning protocol
Database Elements Organized in Hierarchy
Rules of Warning Protocol
These involve both ordinary (S and X) and warning (IS and IX) locks
The rules are: Begin at the root of hierarchy Request the S/X lock if we are at the desired
element If the desired element id further down the
hierarchy, place a warning lock (IS if S and IX if X)
When the warning lock is granted, we proceed to the child node and repeat the above steps until desired node is reached
Compatibility Matrix for Shared, Exclusive and Intention Locks
IS IX S X
IS Yes Yes Yes No
IX Yes Yes No No
S Yes No Yes No
X No No No No
• The above matrix applies only to locks held by other transactions
Group Modes of Intention Locks An element can request S and IX locks at
the same time if they are in the same transaction (to read entire element and then modify sub elements)
This can be considered as another lock mode, SIX, having restrictions of both the locks i.e. No for all except IS
SIX serves as the group mode
Example
Consider a transaction T1 as follows Select * from table where attribute1 = ‘abc’ Here, IS lock is first acquired on the entire
relation; then moving to individual tuples (attribute = ‘abc’), S lock in acquired on each of them
Consider another transaction T2 Update table set attribute2 = ‘def’ where
attribute1 = ‘ghi’ Here, it requires an IX lock on relation and
since T1’s IS lock is compatible, IX is granted
On reaching the desired tuple (ghi), as there is no lock, it gets X too
If T2 was updating the same tuple as T1, it would have to wait until T1 released its S lock
Phantoms and Handling Insertions Correctly This arises when transactions create new
sub elements of lockable elements Since we can lock only existing elements
the new elements fail to be locked The problem created in this situation is
explained in the following example
Example
Consider a transaction T3
Select sum(length) from table where attribute1 = ‘abc’
This calculates the total length of all tuples having attribute1
Thus, T3 acquires IS for relation and S for targeted tuples
Now, if another transaction T4 inserts a new tuple having attribute1 = ‘abc’, the result of T3 becomes incorrect
Example (…contd.)
This is not a concurrency problem since the serial order (T3, T4) is maintained
But if both T3 and T4 were to write an element X, it could lead to unserializable behavior r3(t1);r3(t2);w4(t3);w4(X);w3(L);w3(X) r3 and w3 are read and write operations by T3 and w4
are the write operations by T4 and L is the total length calculated by T3 (t1 + t2)
At the end, we have result of T3 as sum of lengths of t1 and t2 and X has value written by T3
This is not right; if value of X is considered to be that written by T3 then for the schedule to be serializable, the sum of lengths of t1, t2 and t3 should be considered
Example (…contd.)
Else if the sum is total length of t1 and t2 then for the schedule to be serializable, X should have value written by T4
This problem arises since the relation has a phantom tuple (the new inserted tuple), which should have been locked but wasn’t since it didn’t exist at the time locks were taken
The occurrence of phantoms can be avoided if all insertion and deletion transactions are treated as write operations on the whole relation
CONCURRENCY CONTROL
SECTION 18.7THE TREE PROTOCOL
By :Saloni Tamotia (215)
BASICS B-Trees
- Tree data structure that keeps data sorted
- allow searches, insertion, and deletion
- commonly used in database and file systems Lock
- Enforce limits on access to resources
- way of enforcing concurrency control Lock Granularity
- Level and type of information that lock
protects.
TREE PROTOCOL Kind of graph-based protocol Alternate to Two-Phased
Locking (2PL) database elements are disjoint
pieces of data Nodes of the tree DO NOT
form a hierarchy based on containment
Way to get to the node is through its parent
Example: B-Tree
ADVANTAGES OF TREE
PROTOCOLUnlocking takes less time as
compared to 2PL
Freedom from deadlocks
18.7.1 MOTIVATION FOR
TREE-BASED LOCKING Consider B-Tree Index, treating individual
nodes as lockable database elements. Concurrent use of B-Tree is not possible
with standard set of locks and 2PL. Therefore, a protocol is needed which can
assure serializability by allowing access to the elements all the way at the bottom of the tree even if the 2PL is violated.
18.7.1 MOTIVATION FOR TREE-BASED LOCKING
(cont.)Reason for : “Concurrent use of B-Tree is not
possible with standard set of locks and 2PL.”
every transaction must begin with locking the root node
2PL transactions can not unlock the root until all the required locks are acquired.
18.7.2 ACCESSING TREE
STRUCTURED DATAAssumptions:
Only one kind of lockConsistent transactionsLegal schedulesNo 2PL requirement on transaction
18.7.2 RULES FOR ACCESSING TREE
STRUCTURED DATARULES:
First lock can be at any node. Subsequent locks may be acquired only after
parent node has a lock. Nodes may be unlocked any time. No relocking of the nodes even if the node’s
parent is still locked
18.7.3 WHY TREE
PROTOCOL WORKS? Tree protocol implies a serial order on
transactions in the schedule. Order of precedence:
Ti < s Tj If Ti locks the root before Tj, then Ti locks
every node in common with Tj before Tj.
ORDER OF PRECEDENCE
CONCURRENCY CONTROL
SECTION 18.8Timestamps
By :Rupinder Singh (216)
What is Timestamping?
Scheduler assign each transaction T a unique number, it’s timestamp TS(T).
Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning.
Timestamp TS(T)
Two methods of generating Timestamps. Use the value of system, clock as the
timestamp. Use a logical counter that is incremented
after a new timestamp has been assigned. Scheduler maintains a table of currently
active transactions and their timestamps irrespective of the method used
Timestamps for database element X and commit bit
RT(X):- The read time of X, which is the highest timestamp of transaction that has read X.
WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.
C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.
Physically Unrealizable Behavior
Read too late: A transaction U that started after
transaction T, but wrote a value for X before T reads X.
U writes X
T reads X
T start U start
Physically Unrealizable Behavior
Write too late A transaction U that started after T, but
read X before T got a chance to write X.
U reads X
T writes X
T start U start
Figure: Transaction T tries to write too late
Dirty Read
It is possible that after T reads the value of X written by U, transaction U will abort.
U writes X
T reads X
U start T start U aborts
T could perform a dirty read if it reads X when shown
Rules for Timestamps-Based scheduling
1. Scheduler receives a request rT(X)
a) If TS(T) ≥ WT(X), the read is physically realizable.
1. If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).
2. If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.
b) If TS(T) < WT(X), the read is physically unrealizable. Rollback T.
Rules for Timestamps-Based scheduling (Cont.)
2. Scheduler receives a request WT(X).a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and must be performed.
1. Write the new value for X,2. Set WT(X) := TS(T), and3. Set C(X) := false.
b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X.
a. If C(X) is true, then the previous writers of X is committed, and ignore the write by T.
b. If C(X) is false, we must delay T.c) if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.
Rules for Timestamps-Based scheduling (Cont.)
3. Scheduler receives a request to commit T. It must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.
4. Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.
Multiversion Timestamps
Multiversion schemes keep old versions of data item to increase concurrency.
Each successful write results in the creation of a new version of the data item written.
Use timestamps to label versions. When a read(X) operation is issued, select
an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.
Timestamps and Locking
Generally, timestamping performs better than locking in situations where: Most transactions are read-only. It is rare that concurrent transaction will
try to read and write the same element. In high-conflict situation, locking performs
better than timestamps
CONCURRENCY CONTROL
SECTION 18.8Timestamps
What is Timestamping?
Scheduler assign each transaction T a unique number, it’s timestamp TS(T).
Timestamps must be issued in ascending order, at the time when a transaction first notifies the scheduler that it is beginning.
Timestamp TS(T)
Two methods of generating Timestamps. Use the value of system, clock as the
timestamp. Use a logical counter that is incremented
after a new timestamp has been assigned. Scheduler maintains a table of currently
active transactions and their timestamps irrespective of the method used
Timestamps for database element X and commit bit RT(X):- The read time of X, which is the
highest timestamp of transaction that has read X.
WT(X):- The write time of X, which is the highest timestamp of transaction that has write X.
C(X):- The commit bit for X, which is true if and only if the most recent transaction to write X has already committed.
Physically Unrealizable Behavior
Read too late: A transaction U that started after
transaction T, but wrote a value for X before T reads X.
U writes X
T reads X
T start U start
Physically Unrealizable Behavior
Write too late A transaction U that started after T, but
read X before T got a chance to write X.
U reads X
T writes X
T start U start
Figure: Transaction T tries to write too late
Dirty Read
It is possible that after T reads the value of X written by U, transaction U will abort.
U writes X
T reads X
U start T start U aborts
T could perform a dirty read if it reads X when shown
Rules for Timestamps-Based scheduling
1. Scheduler receives a request rT(X)
a) If TS(T) ≥ WT(X), the read is physically realizable.
1. If C(X) is true, grant the request, if TS(T) > RT(X), set RT(X) := TS(T); otherwise do not change RT(X).
2. If C(X) is false, delay T until C(X) becomes true or transaction that wrote X aborts.
b) If TS(T) < WT(X), the read is physically unrealizable. Rollback T.
Rules for Timestamps-Based scheduling (Cont.)2. Scheduler receives a request WT(X).
a) if TS(T) ≥ RT(X) and TS(T) ≥ WT(X), write is physically realizable and must be performed.
1. Write the new value for X,2. Set WT(X) := TS(T), and3. Set C(X) := false.
b) if TS(T) ≥ RT(X) but TS(T) < WT(X), then the write is physically realizable, but there is already a later values in X.
a. If C(X) is true, then the previous writers of X is committed, and ignore the write by T.
b. If C(X) is false, we must delay T.c) if TS(T) < RT(X), then the write is physically unrealizable, and T must be rolled back.
Rules for Timestamps-Based scheduling (Cont.)3. Scheduler receives a request to commit T. It
must find all the database elements X written by T and set C(X) := true. If any transactions are waiting for X to be committed, these transactions are allowed to proceed.
4. Scheduler receives a request to abort T or decides to rollback T, then any transaction that was waiting on an element X that T wrote must repeat its attempt to read or write.
Multiversion Timestamps
Multiversion schemes keep old versions of data item to increase concurrency.
Each successful write results in the creation of a new version of the data item written.
Use timestamps to label versions. When a read(X) operation is issued, select
an appropriate version of X based on the timestamp of the transaction, and return the value of the selected version.
Timestamps and Locking
Generally, timestamping performs better than locking in situations where: Most transactions are read-only. It is rare that concurrent transaction will
try to read and write the same element. In high-conflict situation, locking performs
better than timestamps
CONCURRENCY CONTROL BY
VALIDATION(18.9)
By
Swathi Vegesna
217
At a Glance
Introduction Validation based scheduling Validation based Scheduler Expected exceptions Validation rules Example Comparisons Summary
Introduction
What is optimistic concurrency control?(assumes no unserializable behavior will occur) Timestamp- based scheduling and
Validation-based scheduling(allows T to access data without locks)
Validation based scheduling
Scheduler keeps a record of what the active transactions are doing.
Executes in 3 phases1. Read- reads from RS( ), computes local address
2. Validate- compares read and write sets
3. Write- writes from WS( )
Validation based Scheduler
Contains an assumed serial order of transactions.
Maintains three sets: START( ): set of T’s started but not
completed validation. VAL( ): set of T’s validated but not finished
the writing phase. FIN( ): set of T’s that have finished.
Expected exceptions
1. Suppose there is a transaction U, such that: U is in VAL or FIN; that is, U has validated, FIN(U)>START(T); that is, U did not finish before T started RS(T) ∩WS(T) ≠φ; let it contain database element X.
2. Suppose there is transaction U, such that:• U is in VAL; U has successfully validated.•FIN(U)>VAL(T); U did not finish before T entered its validation phase.•WS(T) ∩ WS(U) ≠φ; let x be in both write sets.
Validation rules
Check that RS(T) ∩ WS(U)= φ for any previously validated U that did not finish before T has started i.e. FIN(U)>START(T).
Check that WS(T) ∩ WS(U)= φ for any previously validated U that did not finish before T is validated i.e. FIN(U)>VAL(T)
Example
Solution
Validation of U:Nothing to check
Validation of T: WS(U) ∩ RS(T)= {D} ∩{A,B}=φWS(U) ∩ WS(T)= {D}∩ {A,C}=φ
Validation of V:RS(V) ∩ WS(T)= {B}∩{A,C}=φWS(V) ∩ WS(T)={D,E}∩ {A,C}=φRS(V) ∩ WS(U)={B} ∩{D}=φ
Validation of W:RS(W) ∩ WS(T)= {A,D}∩{A,C}={A}WS(W) ∩ WS(V)= {A,D}∩{D,E}={D}WS(W) ∩ WS(V)= {A,C}∩{D,E}=φ (W is not validated)
ComparisonConcurrency control Mechanisms
Storage Utilization Delays
Locks Space in the lock table is proportional to the number of database elements locked.
Delays transactions but avoids rollbacks
Timestamps Space is needed for read and write times with every database element, neither or not it is currently accessed.
Do not delay the transactions but cause them to rollback unless Interface is low
Validation Space is used for timestamps and read or write sets for each currently active transaction, plus a few more transactions that finished after some currently active transaction began.
Do not delay the transactions but cause them to rollback unless interface is low
Summary
Concurrency control by validation The three phases Validation Rules Comparison
References
Database Systems: The Complete Book
21.1 INTRODUCTION TO INFORMATION INTEGRATION
CS257 Fan Yang
Need for Information Integration All the data in the world could put in a
single database (ideal database system) In the real world (impossible for a single
database):databases are created independentlyhard to design a database to support future use
University Database
Registrar: to record student and grade Bursar: to record tuition payments by
students Human Resources Department: to record
employees Other department….
Inconvenient
Record grades for students who pay tuition
Want to swim in SJSU aquatic center for free in summer vacation?(all the cases above cannot achieve the function by a single database)
Solution: one database
How to integrate
Start overbuild one database: contains all the legacy databases; rewrite all the applicationsresult: painful
Build a layer of abstraction (middleware)on top of all the legacy databasesthis layer is often defined by a collection of classes
BUT…
Heterogeneity Problem
What is Heterogeneity ProblemAardvark Automobile Co. 1000 dealers has 1000 databasesto find a model at another dealercan we use this command:
SELECT * FROM CARS WHERE MODEL=“A6”;
Type of Heterogeneity
Communication Heterogeneity Query-Language Heterogeneity Schema Heterogeneity Data type difference Value Heterogeneity Semantic Heterogeneity
Conclusion
One database system is perfect, but impossible
Independent database is inconvenient Integrate database
1. start over2. middleware
heterogeneity problem
CHAPTER 21.2MODES OF INFORMATION
INTEGRATION
ID: 219Name: Qun YuClass: CS257 219 Spring 2009Instructor: Dr. T.Y.Lin
Content Index
21.2 Modes of Information Integration21.2.1 Federated Database Systems
21.2.2 Data Warehouses21.2.3 Mediators
Federations
The simplest architecture for integrating several DBs
One to one connections between all pairs of DBs
n DBs talk to each other, n(n-1) wrappers are needed
Good when communications between DBs are limited
Wrapper : a software translates incoming queries and outgoing answers. In a result, it allows information sources to conform to some shared schema.
Wrapper
Federations Diagram
DB2DB1
DB3 DB4
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers
2 Wrappers 2 Wrappers
A federated collection of 4 DBs needs 12 components to translate queries from one to another.
Example
Car dealers want to share their inventory. Each dealer queries the other’s DB to find the needed car.
Dealer-1’s DB relation: NeededCars(model,color,autoTrans)
Dealer-2’s DB relation: Auto(Serial, model, color)
Options(serial,option)
Dealer-1’s DB Dealer-2’s DB
wrapper
wrapper
Example…
Dealer 1 queries Dealer 2 for needed cars
For(each tuple(:m,:c,:a) in NeededCars){
if(:a=TRUE){/* automatic transmission wanted */
SELECT serial
FROM Autos, Options
WHERE Autos.serial = Options.serial AND Options.option = ‘autoTrans’
AND Autos.model = :m AND Autos.color =:c;
}
Else{/* automatic transmission not wanted */
SELECT serial
FROM Auto
WHERE Autos.model = :m AND
Autos.color = :c AND
NOT EXISTS( SELECT * FROM Options WHERE serial = Autos.serial
AND option=‘autoTrans’);
}
}
Data Warehouse
Sources are translated from their local schema to a global schema and copied to a central DB.
User transparent: user uses Data Warehouse just like an ordinary DB
User is not allowed to update Data Warehouse
Warehouse Diagram
Warehouse
Extractor Extractor
Source 1 Source 2
User query
result
Combiner
Example
Construct a data warehouse from sources DB of 2 car dealers:
Dealer-1’s schema: Cars(serialNo, model,color,autoTrans,cdPlayer,…)Dealer-2’s schema: Auto(serial,model,color)
Options(serial,option)
Warehouse’s schema: AutoWhse(serialNo,model,color,autoTrans,dealer)
Extractor --- Query to extract data from Dealer-1’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)SELECT serialNo,model,color,autoTrans,’dealer1’ From Cars;
Example
Extractor --- Query to extract data from Dealer-2’s data:
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)SELECT serialNo,model,color,’yes’,’dealer2’ FROM Autos,OptionsWHERE Autos.serial=Options.serial AND
option=‘autoTrans’;
INSERT INTO AutosWhse(serialNo, model, color, autoTans, dealer)SELECT serialNo,model,color,’no’,’dealer2’ FROM AutosWHERE NOT EXISTS ( SELECT * FROM serial =Autos.serial AND option = ‘autoTrans’);
Construct Data Warehouse
1) Periodically reconstructed from the current data in the sources, once a night or at even longer intervals.
Advantages:
simple algorithms.
Disadvantages:
1) need to shut down the warehouse;
2) data can become out of date.
There are mainly 3 ways to constructing the data in the warehouse:
Construct Data Warehouse
2) Updated periodically based on the changes(i.e. each night) of the sources.
Advantages:
involve smaller amounts of data. (important when warehouse is large and needs to be modified in a short period)
Disadvantages:
1) the process to calculate changes to the warehouse is complex.
2) data can become out of date.
Construct Data Warehouse
3) Changed immediately, in response to each change or a small set of changes at one or more of the sources.
Advantages:data won’t become out of date. Disadvantages: requires too much communication, therefore, it is generally too expensive.
(practical for warehouses whose underlying sources changes slowly.)
Mediators
Virtual warehouse, which supports a virtual view or a collection of views, that integrates several sources.
Mediator doesn’t store any data. Mediators’ tasks: 1)receive user’s query, 2)send queries to wrappers, 3)combine results from wrappers, 4)send the final result to user.
A Mediator diagram
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
Example
Same data sources as the example of data warehouse, the mediatorIntegrates the same two dealers’ source into a view with schema:
AutoMed(serialNo,model,color,autoTrans,dealer)
When the user have a query:
SELECT sericalNo, model FROM AkutoMedWhere color=‘red’
Example
In this simple case, the mediator forwards the same query to eachOf the two wrappers.
Wrapper1: Cars(serialNo, model, color, autoTrans, cdPlayer, …)SELECT serialNo,model FROM carsWHERE color = ‘red’;
Wrapper2: Autos(serial,model,color); Options(serial,option)SELECT serial, modelFROM AutosWHERE color=‘red’;
The mediator needs to interprets serial into serialNo, and then returns the union of these sets of data to user.
Example
There may be different options for the mediator to forward user query,for example, the user queries if there are a specific model&color car(i.e. “Gobi”, “blue”).
The mediator decides 2nd query is needed or not based on the result of 1st query. That is, If dealer-1 has the specific car, the mediator doesn’t have to query dealer-2.
THANK YOU !Reference:
Database Systems– The complete Book 2nd Edition, 21.2
notes from http://infolab.stanford.edu/~ullman/dscb.html
CHAPTER 21 INFORMATION INTEGRATION
21.3 WRAPPERS IN MEDIATOR-BASED SYSTEMS
Presented by: Kai Zhu
Professor: Dr. T.Y. Lin
Class ID: 220
Intro Templates for Query patterns Wrapper Generator Filter
Wrappers in Mediator-based Systems More complicated than that in most data
warehouse system. Able to accept a variety of queries from the
mediator and translate them to the terms of the source.
Communicate the result to the mediator.
How to design a wrapper?
Classify the possible queries that the mediator can ask into templates, which are queries with parameters that represent constants.
Templates for Query Patterns:Templates for Query Patterns:
Use notation T=>S to express the idea Use notation T=>S to express the idea that the template T is turned by the that the template T is turned by the wrapper into the source query S.wrapper into the source query S.
Example 1Dealer 1
Cars (serialNo, model, color, autoTrans, navi,…)
For use by a mediator with schemaAutoMed (serialNo, model, color,
autoTrans, dealer)
We denote the code representing that color by the parameter $c, then the template will be:
SELECT *FROM AutosMedWHERE color = ’$c’;
=>SELECT serialNo, model, color, autoTrans, ’dealer1’FROM CarsWHERE color=’$c’;
(Template T => Source query S)
There will be total 2n templates if we have the option of specifying n attributes.
Wrapper Generators The wrapper generator creates a table
holds the various query patterns contained in the templates.
The source queries that are associated with each.
A driver is used in each wrapper, the task of the driver is to:
Accept a query from the mediator. Search the table for a template that
matches the query. The source query is sent to the source,
again using a “plug-in” communication mechanism.
The response is processed by the wrapper.
Filter Have a wrapper filter to supporting more
queries.
Example 2 If wrapper is designed with more
complicated template with queries specify both model and color.
SELECT *
FROM AutosMed
WHERE model = ’$m’ AND color = ’$c’;
=>
SELECT serialNo, model, color, autoTrans, ’dealer1’
FROM Cars
WHERE model = ’$m’ AND color=’$c’;
Now we suppose the only template we have is color. However the wrapper is asked by the Mediator to find “blue Gobi model car.”
Solution:1. Use template with $c=‘blue’ find all
blue cars and store them in a temporary relation:
TemAutos (serialNo, model, color, autoTrans, dealer)
2.The wrapper then return to the mediator the desired set of automobiles by excuting the local query:SELECT*
FROM TemAutos
WHERE model= ’Gobi’;
SECTIONS 21.4 – 21.5Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
INFORMATION INTEGRATION
Presentation Outline
21.4 Capability Based Optimization 21.4.1The Problem of Limited Source
Capabilities 21.4.2 A notation for Describing Source
Capabilities 21.4.3 Capability-Based Query-Plan
Selection 21.4.4 Adding Cost-Based Optimization
21.5 Optimizing Mediator Queries 21.5.1 Simplified Adornment Notation 21.5.2 Obtaining Answers for Subgoals 21.5.3 The Chain Algorithm 21.5.4 Incorporating Union Views at the
Mediator
21.4 Capability Based Optimization Introduction
Typical DBMS estimates the cost of each query plan and picks what it believes to be the best
Mediator – has knowledge of how long its sources will take to answer
Optimization of mediator queries cannot rely on cost measure alone to select a query plan
Optimization by mediator follows capability based optimization
21.4.1 The Problem of Limited Source Capabilities Many sources have only Web Based
interfaces Web sources usually allow querying
through a query form E.g. Amazon.com interface allows us to
query about books in many different ways. But we cannot ask questions that are too
general E.g. Select * from books;
21.4.1 The Problem of Limited Source Capabilities (con’t) Reasons why a source may limit the ways
in which queries can be asked Earliest database did not use relational
DBMS that supports SQL queries Indexes on large database may make certain
queries feasible, while others are too expensive to execute
Security reasons E.g. Medical database may answer queries about
averages, but won’t disclose details of a particular patient's information
21.4.2 A Notation for Describing Source Capabilities For relational data, the legal forms of queries
are described by adornments Adornments – Sequences of codes that
represent the requirements for the attributes of the relation, in their standard order f(free) – attribute can be specified or not b(bound) – must specify a value for an attribute
but any value is allowed u(unspecified) – not permitted to specify a value
for a attribute
21.4.2 A notation for Describing Source Capabilities….(cont’d)
c[S](choice from set S) means that a value must be specified and value must be from finite set S.
o[S](optional from set S) means either do not specify a value or we specify a value from finite set S
A prime (f’) specifies that an attribute is not a part of the output of the query
A capabilities specification is a set of adornments A query must match one of the adornments in its
capabilities specification
21.4.2 A notation for Describing Source Capabilities….(cont’d)
E.g. Dealer 1 is a source of data in the form:Cars (serialNo, model, color, autoTrans, navi)The adornment for this query form is b’uuuu
21.4.3 Capability-Based Query-Plan Selection Given a query at the mediator, a capability
based query optimizer first considers what queries it can ask at the sources to help answer the query
The process is repeated until: Enough queries are asked at the sources to resolve
all the conditions of the mediator query and therefore query is answered. Such a plan is called feasible.
We can construct no more valid forms of source queries, yet still cannot answer the mediator query. It has been an impossible query.
21.4.3 Capability-Based Query-Plan Selection (cont’d) The simplest form of mediator query
where we need to apply the above strategy is join relations
E.g we have sources for dealer 2 Autos(serial, model, color) Options(serial, option)
Suppose that ubf is the sole adornment for Auto and Options have two adornments, bu and uc[autoTrans, navi]
Query is – find the serial numbers and colors of Gobi models with a navigation system
21.4.4 Adding Cost-Based Optimization
Mediator’s Query optimizer is not done when the capabilities of the sources are examined
Having found feasible plans, it must choose among them
Making an intelligent, cost based query optimization requires that the mediator knows a great deal about the costs of queries involved
Sources are independent of the mediator, so it is difficult to estimate the cost
21.5 Optimizing Mediator Queries Chain algorithm – a greed algorithm that
finds a way to answer the query by sending a sequence of requests to its sources. Will always find a solution assuming at least
one solution exists. The solution may not be optimal.
21.5.1 Simplified Adornment Notation A query at the mediator is limited to b
(bound) and f (free) adornments. We use the following convention for
describing adornments: nameadornments(attributes) where:
name is the name of the relation the number of adornments = the number of
attributes
21.5.2 Obtaining Answers for Subgoals
Rules for subgoals and sources: Suppose we have the following subgoal:
Rx1x2…xn(a1, a2, …, an),
and source adornments for R are: y1y2…yn. If yi is b or c[S], then xi = b. If xi = f, then yi is not output restricted.
The adornment on the subgoal matches the adornment at the source: If yi is f, u, or o[S] and xi is either b or f.
21.5.3 The Chain Algorithm
Maintains 2 types of information: An adornment for each subgoal. A relation X that is the join of the relations for
all the subgoals that have been resolved. Initially, the adornment for a subgoal is b
iff the mediator query provides a constant binding for the corresponding argument of that subgoal.
Initially, X is a relation over no attributes, containing just an empty tuple.
21.5.3 The Chain Algorithm (con’t) First, initialize adornments of subgoals
and X. Then, repeatedly select a subgoal that
can be resolved. Let Rα(a1, a2, …, an) be the subgoal:
1. Wherever α has a b, we shall find the argument in R is a constant, or a variable in the schema of R. Project X onto its variables that appear in R.
21.5.3 The Chain Algorithm (con’t)2. For each tuple t in the project of X,
issue a query to the source as follows (β is a source adornment).
If a component of β is b, then the corresponding component of α is b, and we can use the corresponding component of t for source query.
If a component of β is f, and the corresponding component of α is b, provide a constant value for source query.
21.5.3 The Chain Algorithm (con’t)
If a component of β is u, then provide no binding for this component in the source query.
If a component of β is o[S], and the corresponding component of α is f, then treat it as if it was a f.
If a component of β is o[S], and the corresponding component of α is b, then treat it as if it was c[S].
3. Every variable among a1, a2, …, an is now bound. For each remaining unresolved subgoal, change its adornment so any position holding one of these variables is b.
21.5.3 The Chain Algorithm (con’t)4. Replace X with X πs(R), where S is all
of the variables among: a1, a2, …, an.
5. Project out of X all components that correspond to variables that do not appear in the head or in any unresolved subgoal.
If every subgoal is resolved, then X is the answer.
If every subgoal is not resolved, then the algorithm fails.
α
21.5.3 The Chain Algorithm Example Mediator query:
Q: Answer(c) ← Rbf(1,a) AND Sff(a,b) AND Tff(b,c) Example:
Relation R S TData
Adornment bf c’[2,3,5]f bu
w x
1 2
1 3
1 4
x y
2 4
3 5
y z
4 6
5 7
5 8
21.5.3 The Chain Algorithm Example (con’t) Initially, the adornments on the subgoals are
the same as Q, and X contains an empty tuple. S and T cannot be resolved because they each
have ff adornments, but the sources have either a b or c.
R(1,a) can be resolved because its adornments are matched by the source’s adornments.
Send R(w,x) with w=1 to get the tables on the previous page.
21.5.3 The Chain Algorithm Example (con’t) Project the subgoal’s relation onto its
second component, since only the second component of R(1,a) is a variable.
This is joined with X, resulting in X equaling this relation.
Change adornment on S from ff to bf.
a
2
3
4
21.5.3 The Chain Algorithm Example (con’t) Now we resolve Sbf(a,b):
Project X onto a, resulting in X. Now, search S for tuples with attribute a
equivalent to attribute a in X.
Join this relation with X, and remove a because it doesn’t appear in the head nor any unresolved subgoal:
a b
2 4
3 5
b
4
5
21.5.3 The Chain Algorithm Example (con’t) Now we resolve Tbf(b,c):
Join this relation with X and project onto the c attribute to get the relation for the head.
Solution is {(6), (7), (8)}.
b c
4 6
5 7
5 8
21.5.4 Incorporating Union Views at the Mediator This implementation of the Chain
Algorithm does not consider that several sources can contribute tuples to a relation.
If specific sources have tuples to contribute that other sources may not have, it adds complexity.
To resolve this, we can consult all sources, or make best efforts to return all the answers.
21.5.4 Incorporating Union Views at the Mediator (con’t) Consulting All Sources
We can only resolve a subgoal when each source for its relation has an adornment matched by the current adornment of the subgoal.
Less practical because it makes queries harder to answer and impossible if any source is down.
Best Efforts We need only 1 source with a matching
adornment to resolve a subgoal. Need to modify chain algorithm to revisit each
subgoal when that subgoal has new bound requirements.
LOCAL-AS-VIEW MEDIATORS
Priya Gangaraju(Class Id:203)
Local-as-View Mediators.
In a LAV mediator, global predicates defined are not views of the source data.
for each source, expressions are defined, involving the global predicates that describe the tuples that the source is able to produce.
Queries are answered at the mediator by discovering all possible ways to construct the query using the views provided by the source.
Motivation for LAV Mediators Sometimes the the relationship between
what the mediator should provide and what the sources provide is more subtle.
For example, consider the predicate Par(c, p) meaning that p is a parent of c which represents the set of all child parent facts that could ever exist.
The sources will provide information about whatever child-parent facts they know.
Motivation(contd..)
There can be sources which may provide child-grandparent facts but not child- parent facts at all.
This source can never be used to answer the child-parent query under GAV mediators.
LAV mediators allow to say that a certain source provides grand parent facts.
They help discover how and when to use that source in a given query.
Terminology for LAV Mediation. The queries at the mediator and the
queries that describe the source will be single Datalog rules.
A query that is a single Datalog rule is often called a conjunctive query.
The global predicates of the LAV mediator are used as the subgoals of mediator queries.
There are conjunctive queries that define views.
Contd..
Their heads each have a unique view predicate that is the name of a view.
Each view definition has a body consisting of global predicates and is associated with a particular source.
It is assumed that each view can be constructed with an all-free adornment.
Example..
Consider global predicate Par(c, p) meaning that p is a parent of c.
One source produces parent facts. Its view is defined by the conjunctive query-
V1(c, p) Par(c, p) Another source produces some grand
parents facts. Then its conjunctive query will be –
V2(c, g) Par(c, p) AND Par(p, g)
Example contd..
The query at the mediator will ask for great-grand parent facts that can be obtained from the sources. The mediator query is –
Q(w, z) Par(w, x) AND Par(x, y) AND Par(y, z)
One solution can be using the parent predicate(V1) directly three times.
Q(w, z) V1(w, x) AND V1 (x, y) AND V1(y, z)
Example contd..
Another solution can be to use V1(parent facts) and V2(grandparent facts).
Q(w, z) V1(w, x) AND V2(x, z)
Or Q(w, z) V2(w, y) AND V1(y, z)
Expanding Solutions.
Consider a query Q, a solution S that has a body whose subgoals are views and each view V is defined by a conjunctive query with that view as the head.
The body of V’s conjunctive query can be substituted for a subgoal in S that uses the predicate V to have a body consisting of only global predicates.
Expansion Algorithm
A solution S has a subgoal V(a1, a2,…an) where ai’s can be any variables or constants.
The view V can be of the form V(b1, b2,….bn) B
Where B represents the entire body. V(a1, a2, … an) can be replaced in solution S
by a version of body B that has all the subgoals of B with variables possibly altered.
Expansion Algorithm contd..
The rules for altering the variables of B are:
1. First identify the local variables B, variables that appear in the body but not in the head.
2. If there are any local variables of B that appear in B or in S, replace each one by a distinct new variable that appears nowhere in the rule for V or in S.
3. In the body B, replace each bi by ai for i = 1,2…n.
Example.
Consider the view definitions, V1(c, p) Par(c, p)
V2(c, g) Par(c, p) AND Par(p, g) One of the proposed solutions S is Q(w, z) V1(w, x) AND V2(x, z) The first subgoal with predicate V1 in the
solution can be expanded as Par(w, x) as there are no local variables.
Example Contd.
The V2 subgoal has a local variable p which doesn’t appear in S nor it has been used as a local variable in another substitution. So p can be left as it is.
Only x and z are to be substituted for variables c and g.
The Solution S now will be Q(w, z) Par(w, x) AND Par(x, p) AND
Par(p,z)
Containment of Conjunctive Queries A containment mapping from Q to E is a
function т from the variables of Q to the variables and constants of E, such that:
1. If x is the ith argument of the head of Q, then т(x) is the ith argument of the head of E.
2. Add to т the rule that т(c)=c for any constant c. If P(x1,x2,… xn) is a subgoal of Q, then P(т(x1), т(x2),… т(xn)) is a subgoal of E.
Example.
Consider two Conjunctive queries:Q1: H(x, y) A(x, z) and B(z, y)
Q2: H(a, b) A(a, c) AND B(d, b) AND A(a, d) When we apply the substitution, Т(x) = a, Т(y) = b, Т(z) = d, the head of Q1
becomes H(a, b) which is the head of Q2.
So,there is a containment mapping from Q1 to Q2.
Example contd..
The first subgoal of Q1 becomes A(a, d) which is the third subgoal of Q2.
The second subgoal of Q1 becomes the second subgoal of Q2.
There is also a containment mapping from Q2 to Q1 so the two conjunctive queries are equivalent.
Why the Containment-Mapping Test Works Suppose there is a containment mapping т
from Q1 to Q2. When Q2 is applied to the database, we look
for substitutions σ for all the variables of Q2. The substitution for the head becomes a
tuple t that is returned by Q2. If we compose т and then σ, we have a
mapping from the variables of Q1 to tuples of the database that produces the same tuple t for the head of Q1.
Finding Solutions to a Mediator Query There can be infinite number of solutions
built from the views using any number of subgoals and variables.
LMSS Theorem can limit the search which states that• If a query Q has n subgoals, then any answer
produced by any solution is also produced by a solution that has at most n subgoals.
If the conjunctive query that defines a view V has in its body a predicate P that doesn’t appear in the body of the mediator query, then we need not consider any solution that uses V.
Example.
Recall the queryQ1: Q(w, z) Par(w, x) AND Par(x, y) AND Par(y, z) This query has three subgoals, so we
don’t have to look at solutions with more than three subgoals.
Why the LMSS Theorem Holds Suppose we have a query Q with n
subgoals and there is a solution S with more than n subgoals.
The expansion E of S must be contained in Query Q, which means that there is a containment mapping from Q to E.
We remove from S all subgoals whose expansion was not the target of one of Q’s subgoals under the containment mapping.
Contd..
We would have a new conjunctive query S’ with at most n subgoals.
If E’ is the expansion of S’ then, E’ is a subset of Q.
S is a subset of S’ as there is an identity mapping.
Thus S need not be among the solutions to query Q.
The Query Compiler
16.1 Parsing and Preprocessing
Meghna Jain(205)Dr. T. Y. Lin
Presentation Outline
16.1 Parsing and Preprocessing
16.1.1 Syntax Analysis and Parse Tree
16.1.2 A Grammar for Simple Subset of SQL
16.1.3 The Preprocessor
16.1.4 Processing Queries Involving Views
Query compilation is divided into three steps
1. Parsing: Parse SQL query into parser tree.
2. Logical query plan: Transforms parse tree into expression tree of relational algebra.
3.Physical query plan: Transforms logical query plan into physical query plan.
. Operation performed
. Order of operation
. Algorithm used
. The way in which stored data is obtained and passed from one
operation to another.
Parser
Preprocessor
Logical Query plan generator
Query rewrite
Preferred logical query plan
Query
Form a query to a logical query plan
Syntax Analysis and Parse Tree Parser takes the sql query and convert it to parse tree. Nodes of parse tree:
1. Atoms: known as Lexical elements such as key words, constants, parentheses, operators, and
other schema elements.
2. Syntactic categories: Subparts that plays a similar role in a query as <Query> ,
<Condition>
Grammar for Simple Subset of SQL<Query> ::= <SFW><Query> ::= (<Query>)
<SFW> ::= SELECT <SelList> FROM <FromList> WHERE <Condition>
<SelList> ::= <Attribute>,<SelList><SelList> ::= <Attribute>
<FromList> ::= <Relation>, <FromList><FromList> ::= <Relation>
<Condition> ::= <Condition> AND <Condition><Condition> ::= <Tuple> IN <Query><Condition> ::= <Attribute> = <Attribute><Condition> ::= <Attribute> LIKE <Pattern>
<Tuple> ::= <Attribute>
Atoms(constants), <syntactic categories>(variable),::= (can be expressed/defined as)
Query and Parse T ree
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query: Give titles of movies that have at least one star
born in 1960
SELECT title FROM StarsIn WHERE starName IN (
SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%'
);
Another query equivalent
SELECT title FROM StarsIn, MovieStarWHERE starName = name AND birthdate LIKE '%1960%' ;
Parse Tree<Query>
<SFW>
SELECT <SelList> FROM <FromList> WHERE <Condition>
<Attribute> <RelName> , <FromList> AND
title StarsIn <RelName>
<Condition> <Condition>
<Attribute> = <Attribute> <Attribute> LIKE <Pattern>
starName name birthdate ‘%1960’
MovieStar <Query>
The Preprocessor
Functions of Preprocessor . If a relation used in the query is virtual view then each
use of this relation in the form-list must replace by parser tree that describe the view.
. It is also responsible for semantic checking 1. Checks relation uses : Every relation mentioned in
FROM-
clause must be a relation or a view in current schema. 2. Check and resolve attribute uses: Every attribute
mentioned
in SELECT or WHERE clause must be an attribute of same relation in the current scope.
3. Check types: All attributes must be of a type appropriate to
their uses.
StarsIn(title,year,starName)
MovieStar(name,address,gender,birthdate)
Query: Give titles of movies that have at least one star
born in 1960
SELECT title FROM StarsIn WHERE starName IN (
SELECT name FROM MovieStar WHERE birthdate LIKE '%1960%'
);
Preprocessing Queries Involving Views
When an operand in a query is a virtual view, the preprocessor needs to replace the operand by a piece of parse tree that represents how the view is constructed from base table.
Base Table: Movies( title, year, length, genre, studioname, producerC#)
View definition : CREATE VIEW ParamountMovies AS
SELECT title, year FROM movies
WHERE studioName = 'Paramount';
Example based on view:
SELECT title FROM ParamountMovies WHERE year = 1979;
16.2 ALGEBRAIC LAWS FOR IMPROVING QUERY PLANSRamya KarriID: 206
Optimizing the Logical Query Plan The translation rules converting a parse tree to a
logical query tree do not always produce the best logical query tree.
It is often possible to optimize the logical query tree by applying relational algebra laws to convert the original tree into a more efficient logical query tree.
Optimizing a logical query tree using relational algebra laws is called heuristic optimization
Relational Algebra Laws
These laws often involve the properties of:
commutativity - operator can be applied to operands independent of order.
E.g. A + B = B + A - The “+” operator is commutative.
associativity - operator is independent of operand grouping.
E.g. A + (B + C) = (A + B) + C - The “+” operator is associative.
Associative and Commutative Operators
The relational algebra operators of cross-product (×), join (⋈), union, and intersection are all associative and commutative.
Commutative
R X S = S X R
R ⋈ S = S ⋈ R
R S = S R
R ∩ S = S ∩ R
Associative
(R X S) X T = S X (R X T)
(R ⋈ S) ⋈ T= S ⋈ (R ⋈ T)
(R S) T = S (R T)
(R ∩ S) ∩ T = S ∩ (R ∩ T)
Laws Involving Selection
Complex selections involving AND or OR can be broken into two or more selections: (splitting laws)
σC1 AND C2 (R) = σC1( σC2 (R))σC1 OR C2 (R) = ( σC1 (R) ) S ( σC2 (R) )
Example R={a,a,b,b,b,c} p1 satisfied by a,b, p2 satisfied by b,c σp1vp2 (R) = {a,a,b,b,b,c} σp1(R) = {a,a,b,b,b} σp2(R) = {b,b,b,c} σp1 (R) U σp2 (R) = {a,a,b,b,b,c}
Laws Involving Selection (Contd..)
Selection is pushed through both arguments for union:
σC(R S) = σC(R) σC(S)
Selection is pushed to the first argument and optionally the second for difference:
σC(R - S) = σC(R) - S
σC(R - S) = σC(R) - σC(S)
Laws Involving Selection (Contd..) All other operators require selection to be
pushed to only one of the arguments. For joins, may not be able to push selection
to both if argument does not have attributes selection requires.
σC(R × S) = σC(R) × SσC(R ∩ S) = σC(R) ∩ SσC(R ⋈ S) = σC(R) ⋈ SσC(R ⋈D S) = σC(R) ⋈D S
Laws Involving Selection (Contd..) Example Consider relations R(a,b) and S(b,c) and
the expression σ (a=1 OR a=3) AND b<c (R ⋈S) σ a=1 OR a=3(σ b<c (R ⋈S)) σ a=1 OR a=3(R ⋈ σ b<c (S)) σ a=1 OR a=3(R) ⋈ σ b<c (S)
Laws Involving Projection
Like selections, it is also possible to push projections down the logical query tree. However, the performance gained is less than selections because projections just reduce the number of attributes instead of reducing the number of tuples.
Laws Involving Projection
Laws for pushing projections with joins:
πL(R × S) = πL(πM(R) × πN(S))
πL(R ⋈ S) = πL((πM(R) ⋈ πN(S))
πL(R ⋈D S) = πL((πM(R) ⋈D πN(S))
Laws Involving Projection Laws for pushing projections with set operations.
Projection can be performed entirely before union.
πL(R UB S) = πL(R) UB πL(S)
Projection can be pushed below selection as long as we also keep all attributes needed for the selection (M = L attr(C)).
πL ( σC (R)) = πL( σC (πM(R)))
Laws Involving Join
We have previously seen these important rules about joins:
1. Joins are commutative and associative.
2. Selection can be distributed into joins.
3. Projection can be distributed into joins.
Laws Involving Duplicate Elimination
The duplicate elimination operator (δ) can be pushed through many operators.
R has two copies of tuples t, S has one copy of t,
δ (RUS)=one copy of t δ (R) U δ (S)=two copies of t
Laws Involving Duplicate Elimination Laws for pushing duplicate elimination
operator (δ):
δ(R × S) = δ(R) × δ(S)
δ(R S) = δ(R) δ(S)
δ(R D S) = δ(R) D δ(S)
δ( σC(R) = σC(δ(R))
Laws Involving Duplicate Elimination The duplicate elimination operator (δ)
can also be pushed through bag intersection, but not across union, difference, or projection in general.
δ(R ∩ S) = δ(R) ∩ δ(S)
Laws Involving Grouping
The grouping operator (γ) laws depend on the aggregate operators used.
There is one general rule, however, that grouping subsumes duplicate elimination:
δ(γL(R)) = γL(R)
The reason is that some aggregate functions are unaffected by duplicates (MIN and MAX) while other functions are (SUM, COUNT, and AVG).
THE QUERY COMPILER
Section 16.3
DATABASE SYSTEMS – The Complete Book
Presented By: Under the supervision of:
Deepti Kundu Dr. T.Y.Lin
Topics to be coveredTopics to be covered
From Parse to Logical Query Plans Conversion to Relational Algebra Removing Subqueries From Conditions Improving the Logical Query Plan Grouping Associative/ Commutative Operators
16.3 From Parse to Logical Query Plans ►
ReviewReview
Query
Preferred logical query plan
Parser
Preprocessor
Logical query plan generator
Query Rewriter
Section 16.1
Section 16.3
Two steps to turn Parse tree into Preferred Two steps to turn Parse tree into Preferred Logical Query PlanLogical Query Plan
Replace the nodes and structures of the parse tree, in appropriate groups, by an operator or operators of relational algebra.
Take the relational algebra expression and turn it into an expression that we expect can be converted to the most efficient physical query plan.
ReferenceReference RelationsRelations
StarsIn (movieTitle, movieYear, starName) MovieStar (name, address, gender, birthdate)
Conversion to Relational AlgebraConversion to Relational Algebra
If we have a <Query> with a <Condition> that has no subqueries, then we may replace the entire construct – the select-list, from-list, and condition – by a relational-algebra expression.
The relational-algebra expression consists of the following from bottom to top: The products of all the relations mentioned in the
<FromList>, which Is the argument of: A selection σC, where C is the <Condition> expression in
the construct being replaced, which in turn is the argument of:
A projection πL , where L is the list of attributes in the <SelList>
A query : ExampleA query : Example
SELECT movieTitle
FROM Starsin, MovieStar
WHERE starName = name AND
birthdate LIKE ‘%1960’;
SELECT movieTitleSELECT movieTitleFROM Starsin, MovieStarFROM Starsin, MovieStarWHERE starName = name AND WHERE starName = name AND birthdate LIKE ‘%1960’; birthdate LIKE ‘%1960’;
Translation to an algebraic expression treeTranslation to an algebraic expression tree
Removing Subqueries From ConditionsRemoving Subqueries From Conditions
For parse trees with a <Condition> that has a subquery
Intermediate operator – two argument selection It is intermediate in between the syntactic
categories of the parse tree and the relational-algebra operators that apply to relations.
Using a two-argument Using a two-argument σσ
πmovieTitle
σ
StarsIn <Condition>
MovieStar
IN πname<Tuple>
starName
σ birthdate LIKE ‘%1960'
<Attribute>
Two argument selection with condition involving Two argument selection with condition involving ININ
Now say we have, two arguments – some relation and the second argument is a <Condition> of the form t IN S. ‘t’ – tuple composed of some attributes of R ‘S’ – uncorrelated subquery
Steps to be followed:1. Replace the <Condition> by the tree that is the expression for S ( δ is
used to remove duplicates)
2. Replace the two-argument selection by a one-argument selection σC.
3. Give σC an argument that is the product of R and S.
Two argument selection with condition involving Two argument selection with condition involving ININ
σ
R <Condition>
t IN S
σC
X
R δ
S
The effectThe effect
Improving the Logical Query PlanImproving the Logical Query Plan
Algebraic laws to improve logical query plans: Selections can be pushed down the expression tree as
far as they can go. Similarly, projections can be pushed down the tree, or
new projections can be added. Duplicate eliminations can sometimes be removed, or
moved to a more convenient position in the tree. Certain selections can be combined with a product
below to turn the pair of operations into an equijoin.
Grouping Associative/ Commutative OperatorsGrouping Associative/ Commutative Operators
An operator that is associative and commutative operators may be though of as having any number of operands.
We need to reorder these operands so that the multiway join is executed as sequence of binary joins.
Its more time consuming to execute them in the order suggested by parse tree.
For each portion of subtree that consists of nodes with the same associative and commutative operator (natural join, union, and intersection), we group the nodes with these operators into a single node with many children.
The effect of query rewritingThe effect of query rewriting
Π movieTitle
Starname = name
StarsIn σbirthdate LIKE ‘%1960’
MovieStar
Final step in producing logical query planFinal step in producing logical query plan
=>
U
U
U
W
R
S T
VU
U V W
R S T
An Example to summarizeAn Example to summarize
“find movies where the average age of the stars was at most 40 when the movie was made”
SELECT distinct m1.movieTitle, m1,movieYearFROM StarsIn m1WHERE m1.movieYear – 40 <= (
SELECT AVG (birthdate)FROM StartsIn m2, MovieStar sWHERE m2.starName = s.name AND
m1.movieTitle = m2.movieTitle ANDm1.movieYear = m2.movieyear
);
Selections combined with a product to turn the Selections combined with a product to turn the pair of operations into an equijoin…pair of operations into an equijoin…
Condition pushed up the expression tree…
`
Selections combined…Selections combined…
THE QUERY COMPILER
(16.4)
DATABASE SYSTEMS – The Complete Book
Presented By: Under the supervision of:
Maciej Kicinski Dr. T.Y.Lin
Topics to be coveredTopics to be covered
From Parse to Logical Query Plans Conversion to Relational Algebra Removing Subqueries From Conditions Improving the Logical Query Plan Grouping Associative/ Commutative Operators
Estimating the Cost of Operation Estimating Sizes of Intermediate Relations Estimating the Size of a Projection Estimating the Size of a Selection Estimating the Size of a Join Estimating Sizes for Other Operations
16.4 From Estimating the Cost of Operation ►
Estimating the Cost of OperationsEstimating the Cost of Operations
After getting to the logical query plan, we turn it into physical plan.
Consider all the possible physical plan and estimate their costs – this evaluation is known as cost-based enumeration.
The one with least estimated cost is the one selected to be passed to the query-execution engine.
Selection for each physical plan
We select for each physical plan: An order and grouping for associative-and-commutative
operations like joins, unions, and intersections. An algorithm for each operator in the logical plan, for
instance, deciding whether a nested-loop join or hash-join should be used.
Additional operators – scanning, sorting etc. – that are needed for the physical plan but that were not present explicitly in the logical plan.
The way in which the arguments are passed from on operator to the next.
Estimating Sizes of Intermediate Relations
1. Give accurate estimates.
2. Are easy to compute.
3. Are logically consistent; that is, the size estimate for an intermediate relation should not depend on how that relation is computed.
Estimating the Size of a Projection
We should treat a classical, duplicate-eliminating projection as a bag-projection.
The size of the result can be computed exactly. There may be reduction in size (due to eliminated
components) or increase in size (due to new components created as combination of attributes).
Estimating the Size of a Selection
While performing selection, we may reduce the number of tuples but the sizes of tuple remain same.
Size can be computed as:
Where A is an attribute of R and c is a constant
The recommended estimate is
T(S) = T(R)/ V(R,A)
S = σ A=c (R)
Estimating Sizes of Other Operations
Union Intersection Difference Duplicate Elimination Grouping and Aggregation
CHOOSING AN ORDER FOR JOINS
Chapter 16.6 by:Chiu LukID: 210
Introduction
This section focuses on critical problem in cost-based optimization: Selecting order for natural join of three or
more relations Compared to other binary operations,
joins take more time and therefore need effective optimization techniques
Introduction
Significance of Left and Right Join Arguments The argument relations in joins
determine the cost of the join The left argument of the join is
Called the build relation Assumed to be smaller Stored in main-memory
Significance of Left and Right Join Arguments The right argument of the join is
Called the probe relation Read a block at a time Its tuples are matched with those of build
relation The join algorithms which distinguish
between the arguments are: One-pass join Nested-loop join Index join
Join Trees
Order of arguments is important for joining two relations
Left argument, since stored in main-memory, should be smaller
With two relations only two choices of join tree
With more than two relations, there are n! ways to order the arguments and therefore n! join trees, where n is the no. of relations
Join Trees
Order of arguments is important for joining two relations
Left argument, since stored in main-memory, should be smaller
With two relations only two choices of join tree
With more than two relations, there are n! ways to order the arguments and therefore n! join trees, where n is the no. of relations
Join Trees
Total # of tree shapes T(n) for n relations given by recurrence:
T(1) = 1 T(2) = 1 T(3) = 2 T(4) = 5 … etc
Left-Deep Join Trees
Consider 4 relations. Different ways to join them are as follows
In fig (a) all the right children are leaves. This is a left-deep tree
In fig (c) all the left children are leaves. This is a right-deep tree
Fig (b) is a bushy tree Considering left-deep trees is
advantageous for deciding join orders
Join order Join order selection
A1 A2 A3 .. An Left deep join trees
Dynamic programming Best plan computed for each subset of relations
Best plan (A1, .., An) = min cost plan of( Best plan(A2, .., An) A1 Best plan(A1, A3, .., An) A2 …. Best plan(A1, .., An-1)) An
Ai
An
Dynamic Programming to Select a Join Order and Grouping Three choices to pick an order for the join of
many relations are: Consider all of the relations Consider a subset Use a heuristic o pick one
Dynamic programming is used either to consider all or a subset Construct a table of costs based on relation
size Remember only the minimum entry which will
required to proceed
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
Dynamic Programming to Select a Join Order and Grouping
A Greedy Algorithm for Selecting a Join Order It is expensive to use an exhaustive
method like dynamic programming Better approach is to use a join-order
heuristic for the query optimization Greedy algorithm is an example of that
Make one decision at a time about order of join and never backtrack on the decisions once made
COMPLETING THE PHYSICAL-QUERY-PLAN AND CHAPTER 16 SUMMARY (16.7-16.8)
CS257 Spring 2009Professor Tsau Lin
Student: Suntorn Sae-EungDonavon Norwood
414
Outline
16.7 Completing the Physical-Query-PlanI. Choosing a Selection MethodII. Choosing a Join MethodIII. Pipelining Versus MaterializationIV. Pipelining Unary OperationsV. Pipelining Binary OperationsVI. Notation for Physical Query PlanVII. Ordering the Physical Operations
16.8 Summary of Chapter 16
415
Before complete Physical-Query-Plan
A query previously has been Parsed and Preprocessed (16.1) Converted to Logical Query Plans (16.3) Estimated the Costs of Operations (16.4) Determined costs by Cost-Based Plan
Selection (16.5) Weighed costs of join operations by
choosing an Order for Joins
416
16.7 Completing the Physical-Query-Plan
3 topics related to turning LP into a complete physical plan
1. Choosing of physical implementations such as Selection and Join methods
2. Decisions regarding to intermediate results (Materialized or Pipelined)
3. Notation for physical-query-plan operators
417
I. Choosing a Selection Method (A)
Algorithms for each selection operators1. Can we use an created index on an
attribute? If yes, index-scan. Otherwise table-scan)2. After retrieve all condition-satisfied tuples
in (1), then filter them with the rest selection conditions
418
Choosing a Selection Method(A) (cont.)
Recall Cost of query = # disk I/O’s How costs for various plans are estimated from σC(R) operation
1. Cost of table-scan algorithm
a) B(R) if R is clusteredb) T(R) if R is not clustered
2. Cost of a plan picking an equality term (e.g. a = 10) w/ index-scan
a) B(R) / V(R, a) clustering indexb) T(R) / V(R, a) nonclustering index
3. Cost of a plan picking an inequality term (e.g. b < 20) w/ index-scan
a) B(R) / 3 clustering indexb) T(R) / 3 nonclustering index
419
Example
Selection: σx=1 AND y=2 AND z<5 (R)
- Where parameters of R(x, y, z) are : T(R)=5000, B(R)=200,
V(R,x)=100, and V(R, y)=500
- Relation R is clustered- x, y have nonclustering indexes, only index
on z is clustering.
420
Example (cont.)
Selection options:1. Table-scan filter x, y, z. Cost is B(R) = 200
since R is clustered.2. Use index on x =1 filter on y, z. Cost is 50
since T(R) / V(R, x) is (5000/100) = 50 tuples, index is not clustering.
3. Use index on y =2 filter on x, z. Cost is 10 since T(R) / V(R, y) is (5000/500) = 10 tuples using nonclustering index.
4. Index-scan on clustering index w/ z < 5 filter x ,y. Cost is about B(R)/3 = 67
421
Example (cont.)
Costsoption 1 = 200 option 2 = 50 option 3 = 10 option 4 = 67
The lowest Cost is option 3. Therefore, the preferred physical plan
1. retrieves all tuples with y = 2 2. then filters for the rest two conditions (x, z).
422
II. Choosing a Join Method
Determine costs associated with each join algorithms: 1. One-pass join, and nested-loop join devotes
enough buffer to joining2. Sort-join is preferred when attributes are pre-
sorted or two or more join on the same attribute such as
(R(a, b) S(a, c)) T(a, d) - where sorting R and S on a will produce result of R S to be sorted on a and used directly in next join
423
3. Index-join for a join with high chance of using index created on the join attribute such as R(a, b) S(b, c)
4. Hashing join is the best choice for unsorted or non-indexing relations which needs multipass join.
Choosing a Join Method (cont.)
424
III. Pipelining Versus Materialization Materialization (naïve way)
store (intermediate) result of each operations on disk
Pipelining (more efficient way)
Interleave the execution of several operations, the tuples
produced by one operation are passed directly to the
operations that used it
store (intermediate) result of each operations on buffer,
which is implemented on main memory
425
Unary = a-tuple-at-a-time or full relation selection and projection are the best
candidates for pipelining.
IV. Pipelining Unary Operations
R
In buf Unaryoperation
Out buf
In buf Unaryoperation
Out buf
M-1 buffers
426
Pipelining Unary Operations (cont.)
Pipelining Unary Operations are implemented by iterators
427
V. Pipelining Binary Operations Binary operations : , , - , , x The results of binary operations can also
be pipelined. Use one buffer to pass result to its
consumer, one block at a time. The extended example shows tradeoffs
and opportunities
428
Example
Consider physical query plan for the expression
(R(w, x) S(x, y)) U(y, z) Assumption
R occupies 5,000 blocks, S and U each 10,000 blocks.
The intermediate result R S occupies k blocks for some k.
Both joins will be implemented as hash-joins, either one-pass or two-pass depending on k
There are 101 buffers available.
429
Example (cont.)
First consider join R S, neither relations fits in buffers Needs two-pass hash-join to partition R into 100 buckets (maximum possible) each bucket has 50
blocks The 2nd pass hash-join uses 51 buffers,
leaving the rest 50 buffers for joining result of R S with U.
430
Example (cont.)
Case 1: suppose k 49, the result of R S occupies at most 49 blocks.
Steps 1. Pipeline in R S into 49 buffers2. Organize them for lookup as a hash table3. Use one buffer left to read each block of
U in turn4. Execute the second join as one-pass join.
431
Example (cont.)
The total number of I/O’s is 55,000 45,000 for two-pass
hash join of R and S 10,000 to read U for
one-pass hash join of (R S) U.
432
Example (cont.)
Case 2: suppose k > 49 but < 5,000, we can still pipeline, but need another strategy which intermediate results join with U in a 50-bucket, two-pass hash-join. Steps are:
1. Before start on R S, we hash U into 50 buckets of 200 blocks each.
2. Perform two-pass hash join of R and U using 51 buffers as case 1, and placing results in 50 remaining buffers to form 50 buckets for the join of R S with U.
3. Finally, join R S with U bucket by bucket.
433
Example (cont.)
The number of disk I/O’s is: 20,000 to read U and write its tuples into
buckets 45,000 for two-pass hash-join R S k to write out the buckets of R S k+10,000 to read the buckets of R S and U
in the final join The total cost is 75,000+2k.
434
Example (cont.)
Compare Increasing I/O’s between case 1 and case 2 k 49 (case 1)
Disk I/O’s is 55,000 k > 50 5000 (case 2)
k=50 , I/O’s is 75,000+(2*50) = 75,100 k=51 , I/O’s is 75,000+(2*51) = 75,102 k=52 , I/O’s is 75,000+(2*52) = 75,104
Notice: I/O’s discretely grows as k increases from 49 50.
435
Example (cont.)
Case 3: k > 5,000, we cannot perform two-pass join in 50 buffers available if result of R S is pipelined. Steps are
1. Compute R S using two-pass join and store the result on disk.
2. Join result on (1) with U, using two-pass join.
436
Example (cont.)
The number of disk I/O’s is: 45,000 for two-pass hash-join R and S k to store R S on disk 30,000 + k for two-pass join of U in R S
The total cost is 75,000+4k.
437
Example (cont.)
In summary, costs of physical plan as function of R S size.
438
VI. Notation for Physical Query Plans
Several types of operators: 1. Operators for leaves2. (Physical) operators for Selection3. (Physical) Sorts Operators4. Other Relational-Algebra Operations
In practice, each DBMS uses its own internal notation for physical query plan.
439
Notation for Physical Query Plans (cont.)
1. Operator for leaves A leaf operand is replaced in LQP tree
TableScan(R) : read all blocks SortScan(R, L) : read in order according to L IndexScan(R, C): scan index attribute A by
condition C of form Aθc.
IndexScan(R, A) : scan index attribute R.A.
This behaves like TableScan but more efficient if R is not clustered.
440
Notation for Physical Query Plans (cont.)
2. (Physical) operators for Selection Logical operator σC(R) is often combined
with access methods. If σC(R) is replaced by Filter(C), and there is no
index on R or an attribute on condition C Use TableScan or SortScan(R, L) to access R
If condition C Aθc AND D for condition D, and there is an index on R.A, then we may Use operator IndexScan(R, Aθc) to access R and Use Filter(D) in place of the selection σC(R)
441
Notation for Physical Query Plans (cont.)
3. (Physical) Sort Operators Sorting can occur any point in physical
plan, which use a notation SortScan(R, L). It is common to use an explicit operator
Sort(L) to sort relation that is not stored. Can apply at the top of physical-query-
plan tree if the result needs to be sorted with ORDER BY clause (г).
442
Notation for Physical Query Plans (cont.)
4. Other Relational-Algebra Operations Descriptive text definitions and signs to
elaborate Operations performed e.g. Join or grouping. Necessary parameters e.g. theta-join or list
of elements in a grouping. A general strategy for the algorithm e.g.
sort-based, hashed based, or index-based. A decision about number of passed to be
used e.g. one-pass, two-pass or multipass. An anticipated number of buffers the
operations will required.
443
Notation for Physical Query Plans (cont.)
Example of a physical-query-plan A physical-query-plan in example 16.36 for
the case k > 5000 TableScan Two-pass hash join Materialize (double line) Store operator
444
Notation for Physical Query Plans (cont.)
Another example A physical-query-plan in example 16.36 for
the case k < 49 TableScan (2) Two-pass hash join Pipelining Different buffers needs Store operator
445
Notation for Physical Query Plans (cont.)
A physical-query-plan in example 16.35 Use Index on condition y = 2 first Filter with the rest condition later on.
446
VII. Ordering of Physical Operations The PQP is represented as a tree
structure implied order of operations. Still, the order of evaluation of interior
nodes may not always be clear. Iterators are used in pipeline manner Overlapped time of various nodes will
make “ordering” no sense.
447
Ordering of Physical Operations (cont.)
3 rules summarize the ordering of events in a PQP tree:
1. Break the tree into sub-trees at each edge that represent materialization. Execute one subtree at a time.
2. Order the execution of the subtree Bottom-top Left-to-right
3. All nodes of each sub-tree are executed simultaneously.
448
Summary of Chapter 16
In this part of the presentation I will talk about the main topics of Chapter 16.
449
COMPILATION OF QUERIES
Compilation means turning a query into a physical query plan, which can be implemented by query engine.
Steps of query compilation : Parsing Semantic checking Selection of the preferred logical query plan Generating the best physical plan
450
THE PARSER
The first step of SQL query processing. Generates a parse tree Nodes in the parse tree corresponds to
the SQL constructs Similar to the compiler of a programming
language
451
VIEW EXPANSION
A very critical part of query compilation. Expands the view references in the
query tree to the actual view. Provides opportunities for the query
optimization.
452
SEMANTIC CHECKING
Checks the semantics of a SQL query. Examines a parse tree. Checks :
Attributes Relation names Types
Resolves attribute references.
453
CONVERSION TO A LOGICAL QUERY PLAN
Converts a semantically parsed tree to a algebraic expression.
Conversion is straightforward but sub queries need to be optimized.
Two argument selection approach can be used.
454
ALGEBRAIC TRANSFORMATION
Many different ways to transform a logical query plan to an actual plan using algebraic transformations.
The laws used for this transformation : Commutative and associative laws Laws involving selection Pushing selection Laws involving projection Laws about joins and products Laws involving duplicate eliminations Laws involving grouping and aggregation
455
ESTIMATING SIZES OF RELATIONS
True running time is taken into consideration when selecting the best logical plan.
Two factors the affects the most in estimating the sizes of relation : Size of relations ( No. of tuples ) No. of distinct values for each attribute of
each relation Histograms are used by some systems.
456
COST BASED OPTIMIZING
Best physical query plan represents the least costly plan.
Factors that decide the cost of a query plan : Order and grouping operations like joins,
unions and intersections. Nested loop and the hash loop joins used. Scanning and sorting operations. Storing intermediate results.
457
PLAN ENUMERATION STRATEGIES
Common approaches for searching the space for best physical plan . Dynamic programming : Tabularizing the
best plan for each sub expression Selinger style programming : sort-order the
results as a part of table Greedy approaches : Making a series of
locally optimal decisions Branch-and-bound : Starts with enumerating
the worst plans and reach the best plan
458
LEFT-DEEP JOIN TREES
Left – Deep Join Trees are the binary trees with a single spine down the left edge and with leaves as right children.
This strategy reduces the number of plans to be considered for the best physical plan.
Restrict the search to Left – Deep Join Trees when picking a grouping and order for the join of several relations.
459
PHYSICAL PLANS FOR SELECTION
Breaking a selection into an index-scan of relation, followed by a filter operation.
The filter then examines the tuples retrieved by the index-scan.
Allows only those to pass which meet the portions of selection condition.
460
PIPELINING VERSUS MATERIALIZING
This flow of data between the operators can be controlled to implement “ Pipelining “ .
The intermediate results should be removed from main memory to save space for other operators.
This techniques can implemented using “ materialization ”.
Both the pipelining and the materialization should be considered by the physical query plan generator.
An operator always consumes the result of other operator and is passed through the main memory.
461
Questions & Answers
For your attentionFor your attention
463
Reference
[1] H. Garcia-Molina, J. Ullman, and J. Widom, “Database System: The Complete Book,” second edition: p.897-913, Prentice Hall, New Jersey, 2008
Query Execution
Chapter 15
Section 15.1
Presented by Khadke, Suvarna
CS 257 (Section II) Id 213
464
Agenda
Query Processor and major parts of Query processor
Physical-Query-Plan Operators Scanning Tables Basic approaches to locate the tuples of
a relation R Sorting While Scanning Tables Computation Model for Physical Operator I/O Cost for Scan Operators Iterators
465
What is a Query Processor
Group of components of a DBMS that converts a user queries and data-modification commands into a sequence of database operations
It also executes those operations Must supply detail regarding how the
query is to be executed
466
Major parts of Query processor467
Query Execution:The algorithms that manipulate the data of the database.
Focus on the operations of extended relational algebra.
Outline of Query Compilation
Query compilation Parsing : A parse tree for the
query is constructed Query Rewrite : The parse tree
is converted to an initial query plan and transformed into logical query plan (less time)
Physical Plan Generation : Logical Q Plan is converted into physical query plan by selecting algorithms and order of execution of these operator.
468
Physical-Query-Plan Operators
Physical operators are implementations of the operator of relational algebra.
They can also be use in non relational algebra operators like “scan” which scans tables, that is, bring each tuple of some relation into main memory
469
Scanning Tables
One of the basic thing we can do in a Physical query plan is to read the entire contents of a relation R.
Variation of this operator involves simple predicate, read only those tuples of the relation R that satisfy the predicate.
470
Scanning Tables
Basic approaches to locate the tuples of a relation R
Table ScanRelation R is stored in secondary memory with its tuples arranged in blocks
It is possible to get the blocks one by one Index-Scan
If there is an index on any attribute of Relation R, we can use this index to get all the tuples of Relation R
471
Sorting While Scanning Tables
Number of reasons to sort a relation Query could include an ORDER BY
clause, requiring that a relation be sorted.
Algorithms to implement relational algebra operations requires one or both arguments to be sorted relations.
Physical-query-plan operator sort-scan takes a relation R, attributes on which the sort is to be made, and produces R in that sorted order
472
Computation Model for Physical Operator Physical-Plan Operator should be
selected wisely which is essential for good Query Processor .
For “cost” of each operator is estimated by number of disk I/O’s for an operation.
The total cost of operation depends on the size of the answer, and includes the final write back cost to the total cost of the query.
473
Parameters for Measuring Costs
Parameters that affect the performance of a query Buffer space availability in the main
memory at the time of execution of the query
Size of input and the size of the output generated
The size of memory block on the disk and the size in the main memory also affects the performance
474
Parameters for Measuring Costs B: The number of blocks are needed to
hold all tuples of relation R. Also denoted as B(R) T:The number of tuples in relationR. Also denoted as T(R)
V: The number of distinct values that appear in a column of a relation R
V(R, a)- is the number of distinct values of column for a in relation R
475
I/O Cost for Scan Operators
If relation R is clustered, then the number of disk I/O for the table-scan operator is = ~B disk I/O’s
If relation R is not clustered, then the number of required disk I/O generally is much higher
A index on a relation R occupies many fewer than B(R) blocks
That means a scan of the entire relation R which takes at least B disk I/O’s will require more I/O’s than the entire index
476
Iterators for Implementation of Physical Operators Many physical operators can be
implemented as an Iterator. Three methods forming the iterator for
an operation are: 1. Open( ) :
This method starts the process of getting tuples
It initializes any data structures needed to perform the operation
477
Iterators for Implementation of Physical Operators
2. GetNext( ): Returns the next tuple in the result If there are no more tuples to return,
GetNext returns a special value NotFound
3. Close( ) : Ends the iteration after all tuples It calls Close on any arguments of
the operator
478
Reference
ULLMAN, J. D., WISDOM J. & HECTOR G., DATABASE SYSTEMS THE COMPLETE BOOK, 2nd Edition, 2008.
479
Query Execution480
One-Pass Algorithms for Database Operations (15.2)
Presented by
Ronak Shah
(214)
April 22, 2009
Introduction
The choice of an algorithm for each operator is an essential part of the process of transforming a logical query plan into a physical query plan.
Main classes of Algorithms: Sorting-based methods Hash-based methods Index-based methods
Division based on degree difficulty and cost: 1-pass algorithms 2-pass algorithms 3 or more pass algorithms
481
One-Pass Algorithm Methods Tuple-at-a-time, unary operations: (selection &
projection)
Full-relation, unary operations
Full-relation, binary operations (set & bag versions of union)
482
One-Pass Algorithms for Tuple-at-a-Time Operations Tuple-at-a-time operations are selection and
projection read the blocks of R one at a time into an input
buffer perform the operation on each tuple move the selected tuples or the projected tuples to
the output buffer
The disk I/O requirement for this process depends only on how the argument relation R is provided. If R is initially on disk, then the cost is whatever it
takes to perform a table-scan or index-scan of R.
483
A selection or projection being performed on a relation R484
One-Pass Algorithms for Unary, fill-Relation Operations
Duplicate Elimination To eliminate duplicates, we can read each block
of R one at a time, but for each tuple we need to make a decision as to whether:
1. It is the first time we have seen this tuple, in which case we copy it to the output, or
2. We have seen the tuple before, in which case we must not output this tuple.
One memory buffer holds one block of R's tuples, and the remaining M - 1 buffers can be used to hold a single copy of every tuple.
485
Managing memory for a one-pass duplicate-elimination486
Duplicate Elimination
When a new tuple from R is considered, we compare it with all tuples seen so far if it is not equal: we copy both to the output and
add it to the in-memory list of tuples we have seen. if there are n tuples in main memory: each new
tuple takes processor time proportional to n, so the complete operation takes processor time proportional to n2.
We need a main-memory structure that allows each of the operations: Add a new tuple, and Tell whether a given tuple is already there
487
Duplicate Elimination (…contd.) The different structures that can be used for such
main memory structures are: Hash table Balanced binary search tree
488
One-Pass Algorithms for Unary, fill-Relation Operations Grouping
The grouping operation gives us zero or more grouping attributes and presumably one or more aggregated attributes
If we create in main memory one entry for each group then we can scan the tuples of R, one block at a time.
The entry for a group consists of values for the grouping attributes and an accumulated value or values for each aggregation.
489
Grouping
The accumulated value is: For MIN(a) or MAX(a) aggregate, record
minimum /maximum value, respectively. For any COUNT aggregation, add 1 for each tuple of
group. For SUM(a), add value of attribute a to the
accumulated sum for its group. AVG(a) is a hard case. We must maintain 2
accumulations: count of no. of tuples in the group & sum of a-values of these tuples. Each is computed as we would for a COUNT & SUM aggregation, respectively. After all tuples of R are seen, take quotient of sum & count to obtain average.
490
One-Pass Algorithms for Binary Operations Binary operations include:
Union Intersection Difference Product Join
491
Set Union
We read S into M - 1 buffers of main memory and build a search structure where the search key is the entire tuple.
All these tuples are also copied to the output.
Read each block of R into the Mth buffer, one at a time.
For each tuple t of R, see if t is in S, and if not, we copy t to the output. If t is also in S, we skip t.
492
Set Intersection
Read S into M - 1 buffers and build a search structure with full tuples as the search key.
Read each block of R, and for each tuple t of R, see if t is also in S. If so, copy t to the output, and if not, ignore t.
493
Set Difference Read S into M - 1 buffers and build a search structure
with full tuples as the search key.
To compute R -s S, read each block of R and examine each tuple t on that block. If t is in S, then ignore t; if it is not in S then copy t to the output.
To compute S -s R, read the blocks of R and examine each tuple t in turn. If t is in S, then delete t from the copy of S in main memory, while if t is not in S do nothing.
After considering each tuple of R, copy to the output those tuples of S that remain.
494
Bag Intersection
Read S into M - 1 buffers.
Multiple copies of a tuple t are not stored individually. Rather store 1 copy of t & associate with it a count equal to no. of times t occurs.
Next, read each block of R, & for each tuple t of R see whether t occurs in S. If not ignore t; it cannot appear in the intersection. If t appears in S, & count associated with t is (+)ve, then output t & decrement count by 1. If t appears in S, but count has reached 0, then do not output t; we have already produced as many copies of t in output as there were copies in S.
495
Bag Difference
To compute S -B R, read tuples of S into main memory & count no. of occurrences of each distinct tuple.
Then read R; check each tuple t to see whether t occurs in S, and if so, decrement its associated count. At the end, copy to output each tuple in main memory whose count is positive, & no. of times we copy it equals that count.
To compute R -B S, read tuples of S into main memory & count no. of occurrences of distinct tuples.
496
Bag Difference (…contd.)
Think of a tuple t with a count of c as c reasons not to copy t to the output as we read tuples of R.
Read a tuple t of R; check if t occurs in S. If not, then copy t to the output. If t does occur in S, then we look at current count c associated with t. If c = 0, then copy t to output. If c > 0, do not copy t to output, but decrement c by 1.
497
Product
Read S into M - 1 buffers of main memory
Then read each block of R, and for each tuple t of R concatenate t with each tuple of S in main memory.
Output each concatenated tuple as it is formed.
This algorithm may take a considerable amount of processor time per tuple of R, because each such tuple must be matched with M - 1 blocks full of tuples. However, output size is also large, & time/output tuple is small.
498
Natural Join
Convention: R(X, Y) is being joined with S(Y, Z), where Y represents all the attributes that R and S have in common, X is all attributes of R that are not in the schema of S, & Z is all attributes of S that are not in the schema of R. Assume that S is the smaller relation.
To compute the natural join, do the following: 1. Read all tuples of S & form them into a main-
memory search structure. Hash table or balanced tree are good e.g. of such structures. Use M - 1 blocks of memory for this purpose.
499
Natural Join (…contd.)
2. Read each block of R into 1 remaining main-memory buffer. For each tuple t of R, find tuples of S that agree with t on all attributes of Y, using the search structure. For each matching tuple of S, form a tuple by joining it with t, & move resulting tuple to output.
500
QUERY EXECUTION
15.3Nested-Loop Joins
By:Saloni Tamotia (215)
Introduction to Nested-Loop Joins Used for relations of any side. Not necessary that relation fits in main
memory Uses “One-and-a-half” pass method in
which for each variation: One argument read just once. Other argument read repeatedly. Two kinds:
Tuple-Based Nested Loop Join Block-Based Nested Loop Join
ADVANTAGES OF NESTED-LOOP JOIN
Fits in the iterator framework.Allows us to avoid storing intermediate relation on disk.
Tuple-Based Nested-Loop Join
Simplest variation of the nested-loop join
Loop ranges over individual tuples
Tuple-Based Nested-Loop Join Algorithm to compute the Join R(X,Y) | | S(Y,Z)
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make tuple t THEN
output t R and S are two Relations with r and s as
tuples. carelessness in buffering of blocks causes the
use of T(R)T(S) disk I/O’s
IMPROVEMENT & MODIFICATION
To decrease the cost Method 1: Use algorithm for Index-Based
joins We find tuple of R that matches given tuple
of S We need not to read entire relation R
Method 2: Use algorithm for Block-Based joins Tuples of R & S are divided into blocks Uses enough memory to store blocks in
order to reduce the number of disk I/O’s.
Block-Based Nested-Loop Join Algorithm
Access to arguments is organized by block. While reading tuples of inner relation
we use less number of I/O’s disk.
Using enough space in main memory to store tuples of relation of the outer loop. Allows to join each tuple of the inner
relation with as many tuples as possible.
Block-Based Nested-Loop Join Algorithm
ALGORITHM:
FOR each chunk of M-1 blocks of S DO
FOR each block b of R DO FOR each tuple t of b DO find the tuples of S in memory that join with t output the join of t with each of these tuples
Block-Based Nested-Loop Join Algorithm
Assumptions: B(S) ≤ B(R) B(S) > M
This means that the neither relation fits in the entire main memory.
Analysis of Nested-Loop Join
Number of disk I/O’s:
[B(S)/(M-1)]*(M-1 +B(R))
or
B(S) + [B(S)B(R)/(M-1)]or approximately B(S)*B(R)/M
TWO-PASS ALGORITHMS BASED ON SORTING
SECTION 15.4
Rupinder Singh
Two-Pass Algorithms Based on Sorting Two-pass Algorithms: where data from the
operand relations is read into main memory, processed in some way, written out to disk again, and then reread from disk to complete the operation
Basic idea
Step 1: Read M blocks of R into main memory. Step 2:Sort these M blocks in main memory, using
an efficient, main-memory sorting algorithm. so we expect that the time to sort will not exceed the disk 1/0 time for step (1).
Step 3: Write the sorted list into M blocks of disk.
Duplicate Elimination Using Sorting δ(R) First we sort the tuples of R in sublists Then we use the available main memory to hold
one block from each sorted sublist Then we repeatedly copy one to the output and
ignore all tuples identical to it. The total cost of this algorithm is 3B(R) This algorithm requires only √B(R)blocks of main
memory, rather than B(R) blocks(one-pass algorithm).
Example
Suppose that tuples are integers, and only two tuples fit on a block. Also, M = 3 and the relation R consists of 17 tuples:
2,5,2,1,2,2,4,5,4,3,4,2,1,5,2,1,3 After first-pass
Sublists Elements
R1 1,2,2,2,2,5
R2 2,3,4,4,4,5
R3 1,1,2,3,5
Example
Second pass
After processing tuple 1
Output: 1Continue the same process with next tuple.
Sublist In memory Waiting on disk
R1 1,2 2,2, 2,5
R2 2,3 4,4, 4,5
R3 1,1 2,3,5
Sublist In memory Waiting on disk
R1 2 2,2, 2,5
R2 2,3 4,4, 4,5
R3 2,3 5
Grouping and Aggregation Using Sorting γ(R) Two-pass algorithm for grouping and aggregation is quite
similar to the previous algorithm. Step 1:Read the tuples of R into memory, M blocks at a
time. Sort each M blocks, using the grouping attributes of L as the sort key. Write each sorted sublist to disk.
Step 2:Use one main-memory buffer for each sublist, and initially load the first block of each sublist into its buffer.
Step 3:Repeatedly find the least value of the sort key (grouping attributes) present among the first available tuples in the buffers.
This algorithm takes 3B(R) disk 1/0's, and will work as long as B(R) < M².
A Sort-Based Union Algorithm For bag-union one-pass algorithm is used. For set-union
Step 1:Repeatedly bring M blocks of R into main memory, sort their tuples, and write the resulting sorted sublist back to disk.
Step 2:Do the same for S, to create sorted sublists for relation S. Step 3:Use one main-memory buffer for each sublist of R and S.
Initialize each with the first block from the corresponding sublist. Step 4:Repeatedly find the first remaining tuple t among all the
buffers. Copy t to the output. and remove from the buffers all copies of t (if R and S are sets there should be at most two copies)
This algorithm takes 3(B(R)+B(S)) disk 1/0's, and will work as long as B(R)+B(S) < M².
Sort-Based Intersection and Difference For both set version and bag version, the algorithm is same
as that of set-union except that the way we handle the copies of a tuple t at the fronts of the sorted sublists.
For set intersection, output t if it appears in both R and S. For bag intersection, output t the minimum of the number
of times it appears in R and in S. For set difference, R-S, output t if and only if it appears in R
but not in S. For bag difference, R-S, output t the number of times it
appears in R minus the number of times it appears in S.
A Simple Sort-Based Join Algorithm When taking a join, the number of tuples from the
two relations that share a common value of the join attribute(s), and therefore need to be in main memory simultaneously, can exceed what fits in memory
To avoid facing this situation, are can try to reduce main-memory use for other aspects of the algorithm, and thus make available a large number of buffers to hold the tuples with a given join-attribute value
A Simple Sort-Based Join Algorithm Given relations R(X, Y) and S(Y, Z) to join, and given M blocks of main
memory for buffers. Step 1:Sort R and S, using a two-phase, multiway merge sort, with Y as
the sort key. Step 2:Merge the sorted R and S. The following steps are done
repeatedly: Find the least value y of the join attributes Y that is currently at the front of the
blocks for R and S. If y does not appear at the front of the other relation, then remove the tuple(s)
with sort key y. Otherwise, identify all the tuples from both relations having sort key y. Output all the tuples that can be formed by joining tuples from R and S with a
common Y-value y. If either relation has no more unconsidered tuples in main memory.,reload the
buffer for that relation.
A Simple Sort-Based Join Algorithm The simple sort-join uses 5(B(R) + B(S)) disk I/0's. It requires B(R) ≤ M² and B(S) ≤ M² to work.
A More Efficient Sort-Based Join If we do not have to worry about very large numbers of
tuples with a common value for the join attribute(s), then we can save two disk 1/0's per block by combining the second phase of the sorts with the join itself
To compute R(X, Y) S(Y, Z) using M►◄ main-memory buffers Create sorted sublists of size M, using Y as the sort key, for both
R and S. Bring the first block of each sublist into a buffer Repeatedly find the least Y-value y among the first available
tuples of all the sublists. Identify all the tuples of both relations that have Y-value y. Output the join of all tuples from R with all tuples from S that share this common Y-value
A More Efficient Sort-Based Join The number of disk I/O’s is 3(B(R) + B(S)) It requires B(R) + B(S) ≤ M² to work
Summary of Sort-Based Algorithms
Operators ApproximateM required
Disk I/O
γ,δ √ B 3B
U,∩,− √ (B(R) + B(S)) 3(B(R) + B(S))
►◄ √ (max(B(R),B(S))) 5(B(R) + B(S))
►◄(more efficient) √ (B(R) + B(S)) 3(B(R) + B(S))
QUERY EXECUTION15.5 TWO-PASS ALGORITHMS BASED ON HASHING
By
Swathi Vegesna
At a glimpse
Introduction Partitioning Relations by Hashing Algorithm for Duplicate Elimination Grouping and Aggregation Union, Intersection, and Difference Hash-Join Algorithm Sort based Vs Hash based Summary
Introduction
Hashing is done if the data is too big to store in main memory buffers. Hash all the tuples of the argument(s) using
an appropriate hash key. For all the common operations, there is a
way to select the hash key so all the tuples that need to be considered together when we perform the operation have the same hash value.
This reduces the size of the operand(s) by a factor equal to the number of buckets.
Partitioning Relations by HashingAlgorithm:
initialize M-1 buckets using M-1 empty buffers;FOR each block b of relation R DO BEGIN
read block b into the Mth buffer;FOR each tuple t in b DO BEGIN
IF the buffer for bucket h(t) has no room for t THENBEGIN
copy the buffer t o disk;initialize a new empty block in that buffer;
END; copy t to the buffer for bucket h(t);END ;
END ;FOR each bucket DO
IF the buffer for this bucket is not empty THENwrite the buffer to disk;
Duplicate Elimination For the operation δ(R) hash R to M-1 Buckets.
(Note that two copies of the same tuple t will hash to the same bucket)
Do duplicate elimination on each bucket Ri independently, using one-pass algorithm
The result is the union of δ(Ri), where Ri is the portion of R that hashes to the ith bucket
Requirements
Number of disk I/O's: 3*B(R) B(R) < M(M-1), only then the two-pass, hash-
based algorithm will work In order for this to work, we need: hash function h evenly distributes the tuples
among the buckets each bucket Ri fits in main memory (to allow
the one-pass algorithm) i.e., B(R) ≤ M2
Grouping and Aggregation
Hash all the tuples of relation R to M-1 buckets, using a hash function that depends only on the grouping attributes(Note: all tuples in the same group end up in the same bucket)
Use the one-pass algorithm to process each bucket independently
Uses 3*B(R) disk I/O's, requires B(R) ≤ M2
Union, Intersection, and Difference For binary operation we use the same
hash function to hash tuples of both arguments.
R U S we hash both R and S to M-1 R ∩ S we hash both R and S to 2(M-1) R-S we hash both R and S to 2(M-1) Requires 3(B(R)+B(S)) disk I/O’s. Two pass hash based algorithm requires
min(B(R)+B(S))≤ M2
Hash-Join Algorithm
Use same hash function for both relations; hash function should depend only on the join attributes
Hash R to M-1 buckets R1, R2, …, RM-1
Hash S to M-1 buckets S1, S2, …, SM-1
Do one-pass join of Ri and Si, for all i 3*(B(R) + B(S)) disk I/O's; min(B(R),B(S)) ≤ M2
Sort based Vs Hash based
For binary operations, hash-based only limits size to min of arguments, not sum
Sort-based can produce output in sorted order, which can be helpful
Hash-based depends on buckets being of equal size
Sort-based algorithms can experience reduced rotational latency or seek time
Summary
Partitioning Relations by Hashing Algorithm for Duplicate Elimination Grouping and Aggregation Union, Intersection, and Difference Hash-Join Algorithm Sort based Vs Hash based
Index-Based Algorithms
Chapter 15
Section 15.6
Presented by Fan YangCS 257
Class ID218
537
Clustering and Nonclustering Indexes
Clustered Relation: Tuples are packed into roughly as few blocks as can possibly hold those tuples
Clustering indexes: Indexes on attributes that all the tuples with a fixed value for the search key of this index appear on roughly as few blocks as can hold them
538
Clustering and Nonclustering Indexes A relation that isn’t clustered cannot
have a clustering index
A clustered relation can have nonclustering indexes
539
Index-Based Selection
For a selection σC(R), suppose C is of the form a=v, where a is an attribute
For clustering index R.a: the number of disk I/O’s will be B(R)/V(R,a)
540
Index-Based Selection
The actual number may be higher:1. index is not kept entirely in main
memory2. they spread over more blocks3. may not be packed as tightly as
possible into blocks
541
Example
B(R)=1000, T(R)=20,000 number of I/O’s required:
1. clustered, not index 1000
2. not clustered, not index 20,000
3. If V(R,a)=100, index is clustering 10
4. If V(R,a)=10, index is nonclustering 2,000
542
Joining by Using an Index
Natural join R(X, Y) S S(Y, Z)
Number of I/O’s to get RClustered: B(R)Not clustered: T(R)
Number of I/O’s to get tuple t of SClustered: T(R)B(S)/V(S,Y)Not clustered: T(R)T(S)/V(S,Y)
543
Example
R(X,Y): 1000 blocks S(Y,Z)=500 blocksAssume 10 tuples in each block, so T(R)=10,000 and T(S)=5000V(S,Y)=100If R is clustered, and there is a clustering index on Y for Sthe number of I/O’s for R is: 1000 the number of I/O’s for S is10,000*500/100=50,000
544
Joins Using a Sorted Index
Natural join R(X, Y) S (Y, Z) with index on Y for either R or S
Extreme case: Zig-zag join Example:
relation R(X,Y) and R(Y,Z) with index on Y for both relationssearch keys (Y-value) for R: 1,3,4,4,5,6search keys (Y-value) for S: 2,2,4,6,7,8
545
CHAPTER 15.7BUFFER MANAGEMENT
ID: 219Name: Qun YuClass: CS257 219 Spring 2009Instructor: Dr. T.Y.Lin
What does a buffer manager do?
Assume there are M of main-memory buffers needed for the operators on relations to store needed data.
In practice: 1) rarely allocated in advance2) the value of M may vary depending on system
conditions Therefore, buffer manager is used to allow processes
to get the memory they need, while minimizing the delay and unclassifiable requests.
Buffer manager
Buffers
RequestsRead/Writes
Figure 1: The role of the buffer manager : responds to requests for main-memory access to disk blocks
The role of the buffer manager
15.7.1 Buffer Management Architecture
Two broad architectures for a buffer manager:
1) The buffer manager controls main memory directly. • Relational DBMS
2) The buffer manager allocates buffers in virtual memory, allowing the OS to decide how to use buffers. • “main-memory” DBMS • “object-oriented” DBMS
Buffer Pool
Key setting for the Buffer manager to be efficient:
The buffer manager should limit the number of buffers in use so that they fit in the available main memory, i.e. Don’t exceed available space.
The number of buffers is a parameter set when the DBMS is initialized.
No matter which architecture of buffering is used, we simply assume that there is a fixed-size buffer pool, a set of buffers available to queries and other database actions.
Data must be in RAM for DBMS to operate on it! Buffer Manager hides the fact that not all data is in RAM.
DB
MAIN MEMORY
DISK
disk page
free frame
Page Requests from Higher Levels
BUFFER POOL
choice of frame dictatedby replacement policy
Buffer Pool
15.7.2 Buffer Management Strategies
Buffer-replacement strategies:
When a buffer is needed for a newly requested block and the buffer pool is full, what block to throw out the buffer pool?
Buffer-replacement strategy -- LRU
Least-Recently Used (LRU):
To throw out the block that has not been read or written for the longest time.
• Requires more maintenance but it is effective. • Update the time table for every access.• Least-Recently Used blocks are usually less likely to
be accessed sooner than other blocks.
Buffer-replacement strategy -- FIFO
First-In-First-Out (FIFO):
The buffer that has been occupied the longest by the same block is emptied and used for the new block.
• Requires less maintenance but it can make more mistakes.• Keep only the loading time• The oldest block doesn’t mean it is less likely to be
accessed. Example: the root block of a B-tree index
Buffer-replacement strategy – “Clock”
The “Clock” Algorithm (“Second Chance”)
Think of the 8 buffers as arranged in a circle, shown as Figure 3
Flag 0 and 1:
buffers with a 0 flag are ok to sent their contents back to disk, i.e. ok to be replaced
buffers with a 1 flag are not ok to be replaced
Buffer-replacement strategy – “Clock”
0
0
1
1
1
0
0
0
Start point to search a 0 flag
the buffer with a 0 flag will be replaced
The flag will be set to 0
By next time the hand reaches it, if the content of this buffer is not accessed, i.e. flag=0, this buffer will be replaced.That’s “Second Chance”.
Figure 3: the clock algorithm
Buffer-replacement strategy -- Clock
a buffer’s flag set to 1 when:
a block is read into a buffer
the contents of the buffer is accessed
a buffer’s flag set to 0 when:
the buffer manager needs a buffer for a new block, it looks for the first 0 it can find, rotating clockwise. If it passes 1’s, it sets them to 0.
System Control helps Buffer-replacement strategy
System Control
The query processor or other components of a DBMS can give advice to the buffer manager in order to avoid some of the mistakes that would occur with a strict policy such as LRU,FIFO or Clock.
For example:
A “pinned” block means it can’t be moved to disk without first modifying certain other blocks that point to it.
In FIFO, use “pinned” to force root of a B-tree to remain in memory at all times.
15.7.3 The Relationship Between Physical Operator Selection and Buffer Management
Problem:
Physical Operator expected certain number of buffers M for execution.
However, the buffer manager may not be able to guarantee these M buffers are available.
15.7.3 The Relationship Between Physical Operator Selection and Buffer Management
Questions:
Can the algorithm adapt to changes of M, the number of main-memory buffers available?
When available buffers are less than M, and some blocks have to be put in disk instead of in memory.
How the buffer-replacement strategy impact the performance (i.e. the number of additional I/O’s)?
Example
FOR each chunk of M-1 blocks of S DO BEGIN
read these blocks into main-memory buffers;
organize their tuples into a search structure whose
search key is the common attributes of R and S;
FOR each block b of R DO BEGIN
read b into main memory;
FOR each tuple t of b DO BEGIN
find the tuples of S in main memory that
join with t ;
output the join of t with each of these tuples;
END ;
END ;
END ;
Figure 15.8: The nested-loop join algorithm
Example
The outer loop number (M-1) depends on the average number of buffers are available at each iteration.
The outer loop use M-1 buffers and 1 is reserved for a block of R, the relation of the inner loop.
If we pin the M-1 blocks we use for S on one iteration of the outer loop, we shall not lose their buffers during the round.
Also, more buffers may become available and then we could keep more than one block of R in memory.
Will these extra buffers improve the running time?
Example
CASE1: NO
Buffer-replacement strategy: LRUBuffers for R: kWe read each block of R in order into buffers.By end of the iteration of the outer loop, the last k blocks of R are in buffers.However, next iteration will start from the beginning of R again. Therefore, the k buffers for R will need to be replaced.
Example
CASE 2: YES
Buffer-replacement strategy: LRUBuffers for R: kWe read the blocks of R in an order that alternates: firstlast and then lastfirst.In this way, we save k disk I/Os on each iteration of the outer loop except the first iteration.
Other Algorithms and M buffers
Other Algorithms also are impact by M and the buffer-replacement strategy. Sort-based algorithm
If M shrinks, we can change the size of a sublist.
Unexpected result: too many sublists to allocate each sublist a buffer. Hash-based algorithm
If M shrinks, we can reduce the number of buckets, as long as the buckets still can fit in M buffers.
CHAPTER 15 QUERY EXECUTION
15.8 ALGORITHMS USING MORE THAN TWO PASSES
Presented by: Kai Zhu
Professor: Dr. T.Y. Lin
Class ID: 220
Intro Why we use more than 2 passes Multi-pass Sort-based Algorithms Conclusion
Reason that we use more than two passes:
Two passes are usually enough, however, for the largest relation, we use as many passes as necessary.
Multi-pass Sort-based Algorithms
Suppose we have M main-memory buffers available to sort a relation R, which we assume is stored clustered.
Then we do the following:
BASIS:BASIS: If R fits in M blocks (i.e., B(R)<=M) If R fits in M blocks (i.e., B(R)<=M)
1.1. Read R into main memory.Read R into main memory.
2.2. Sort it using any main-memory Sort it using any main-memory sorting algorithm.sorting algorithm.
3.3. Write the sorted relation to disk.Write the sorted relation to disk.
INDUCTION:If R does not fit into main memory.1. Partition the blocks holding R into M groups, which we shall call R1, R2, R3…
2. Recursively sort Ri for each i=1,2,3…M.3. Merge the M sorted sublists.
If we are not merely sorting R, but performing a unary operation such as δ or γ on R. We can modify the above so that at the final merge we perform the operation on the tuples at the front of the sorted sublists.That is:
For a δ, output one copy of each distinct tuple, and skip over copies of the tuple.
For a γ, sort on the grouping attributes only, and combine the tuples with a given value of these grouping attributes.
ConclusionThe two pass algorithms based on sorting or hashing have natural recursive analogs that take three or more passes and will work for larger amounts of data.