fall 2005ceng 351 data management and file structures1 fundamental file structure concepts

43
Fall 2005 CENG 351 Data Management and File Struc tures 1 Fundamental File Structure Concepts

Upload: conrad-nash

Post on 31-Dec-2015

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 1

Fundamental File Structure Concepts

Page 2: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 2

Outline

• File operations

• Field and record organization

• Sequential search and direct access

• Sequential Files

• Sorted Sequential Files

• Co-sequential processing

Page 3: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 3

Files and OS

• The files under an operating systems are maintained by a header.

• Header information may include all the information necessary to maintain the file, such as type, size, disk addresses of the first and last blocks or records, time created, time last updated, protection data, and other information based on the file type and OS.

Page 4: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 4

File types

A file can be seen as

• A stream of bytes (no structure),– Data semantics is lost: there is no way to get

it apart again.

or

• A collection of records with fields (more to come)

Page 5: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 5

Field and Record Organization

DefinitionsRecord: a collection of related fields.Field: the smallest logically meaningful unit of

information in a file.Key: a subset of the fields in a record used to

identify (uniquely) the record.

e.g. In the example file of books:– Each line corresponds to a record.– Fields in each record: ISBN, Author, Title

Page 6: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 6

Record Keys

• Primary key: a key that uniquely identifies a record.

• Secondary key: other keys that may be used for search– Author name

– Book title

– Author name + book title

• Note that in general not every field is a key (keys correspond to fields, or a combination of fields, that may be used in a search).

Page 7: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 7

Field Structures• Fixed-length fields

87359Carroll Alice in wonderland38180Folk File Structures

• Begin each field with a length indicator058735907Carroll19Alice in wonderland053818004Folk15File Structures

• Place a delimiter at the end of each field87359|Carroll|Alice in wonderland|38180|Folk|File Structures|

• Store field as keyword = value ISBN=87359|AU=Carroll|TI=Alice in wonderland|ISBN=38180|AU=Folk|TI=File Structures|

Page 8: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 8

Type Advantages Disadvantages

Fixed Easy to read/write Waste space with padding

Length-based Easy to jump ahead to the end of the field

Long fields require more than 1 byte

Delimited May waste less space than with length-based

Have to check every byte of field against the delimiter

Keyword Fields are self describing. Allows for missing fields

Waste space with keywords

Page 9: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 9

Record Structures

1. Fixed-length records.

2. Fixed number of fields.

3. Begin each record with a length indicator.

4. Use an index to keep track of addresses.

5. Place a delimiter at the end of the record.

Read sections 4.1.5, 4.1.6, 4.2, 4.4 for implementations of different record structures

Page 10: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 10

Fixed-length recordsTwo ways of making fixed-length records:1. Fixed-length records with fixed-length

fields.

2. Fixed-length records with variable-length fields.

87359 Carroll Alice in wonderland

03818 Folk File Structures

87359|Carroll|Alice in wonderland| unused

38180|Folk|File Structures| unused

Page 11: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 11

Variable-length records• Fixed number of fields:

• Record beginning with length indicator:

• Use an index file to keep track of record addresses:– The index file keeps the byte offset for each record; this allows us

to search the index (which have fixed length records) in order to discover the beginning of the record.

• Placing a delimiter: e.g. end-of-line char

87359|Carroll|Alice in wonderland|38180|Folk|File Structures| ...

3387359|Carroll|Alice in wonderland|2638180|Folk|File Structures| ..

Page 12: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 12

Type Advantages Disadvantages

Fixed length record

Easy to jump to the i-th record

Waste space with padding

Variable-length record

Saves space when record sizes are diverse

Cannot jump to the i-th record, unless through an index file or sequential advance

Page 13: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 13

File Organization

• Four basic types of organization:

1. Sequential

2. Indexed

3. Direct

4. Multi-key• In all cases we view a file as a sequence of

records. • A record is a list of fields. Each field has a data

type.

today

Page 14: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 14

File Operations

• Typical Operations:– Retrieve a record – Insert a record– Delete a record– Modify a field of a record– Sort– Merge– Match

Page 15: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 15

Sequential files

• Records are stored contiguously on the storage device.

• Sequential files are read from beginning to end.• Some operations are very efficient on sequential

files (e.g. finding averages)• Organization of records:

1. Unordered sequential files (pile files)

2. Sorted sequential files (records are ordered by some field or fields)

Page 16: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 16

Pile Files

• A pile file is a succession of records, simply placed one after another with no additional structure.

• Records may vary in length.

• Typical Request:– Print all records with a given field value

• e.g. print all books by Folk.

– We must examine each record in the file, in order, starting from the first record.

Page 17: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 17

Timing computations for sequential files

• The smallest data unit that can be read from the disk is a block.

• Thus, knowing the disk address of the block containing the record, time to fetch a record is sum of seek(s), rotational latency(r) and the block transfer time(btt). In fast sequential reading once the starting block is reached, for the remaining of the blocks the rotational latency is assumed zero. This is possible if the blocks are arranged with correct staggering on the disk, on the same or consecutive cylinders.

• For a large file, with very large blocks (b), s and r can be ignored entirely.

Page 18: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 18

Searching Sequential Files• To look-up a record, given the value of one or more of its

fields, we must search the whole file.• In general, (b is the total number of blocks in file):

– At least 1 block is accessed– At most b blocks are accessed.– Thus, the average number of blocks to be read for a randomly

chosen target block to be reached will be:

• (1/b)*ib =(1/b)(b*(b+1)/2)=~b/2. • Thus, time to find and read a record in a pile file is

approximately : TF = (b/2) * btt

Time to fetch one record

Page 19: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 19

Exhaustive Reading of the File

• Time it takes to read and process all records sequentially

TX = b * btt(approximately twice the time to fetch one record)

• Fetching the logical next record is the same as TF

above. Why?– At average, half the records have to be read to locate

the next target record.

Page 20: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 20

Exhaustive Reading of the File

• Note that effective block transfer time (ebt) includes a gap and the block transfer time.– Note that ebt>btt, but we will ignore

thedifference…

Page 21: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 21

Exhaustive Reading of the File

• Random reading of b blocks, where s and r cannot be ignored, as they apply to all independent block reads:

Tx=b(s+r+btt)

Page 22: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 22

Inserting a new record• Just place the new record at the end of the

file (assuming that we don’t worry about duplicates)

– Read the last block– Put the record at the end.

=> TI = s + r + btt + (2r – btt ) + btt

=> TI = s+3r +btt

Time to wait to re-write

Page 23: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 23

Updating a record

• For fixed length records:

TU (fixed length) = TF + 2r

• For variable length records update is treated as a combination of delete and insert.

TU (variable length) = TD + TI

Page 24: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 24

Deleting Records

• What happens if records are deleted? – There is no basic operation that

allows us to “remove part of a file” easily.

– We have to use combination of operations

Page 25: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 25

Strategies for record deletion

1. Record deletion and Storage compaction:– Deletion can be done by “marking” a record

as deleted. • e.g. Place ‘*’ at the beginning of the record

– The space for the record is not released, but the file processing program must include logic that checks if record is deleted or not.

– After a lot of records have been deleted, a special program is used to compact or squeeze the file – this is called Storage Compaction.

Page 26: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 26

Strategies for record deletion (cont.)

2. Deleting fixed-length records and reclaiming space dynamically.

• How to use the space freed by deleted records for storing records that are added later?

– Use an “AVAIL LIST”, a linked list of available records.

– A header record (at the beginning of the file) stores the beginning of the AVAIL LIST

– When a record is deleted, it is marked as deleted and inserted into AVAIL LIST

Page 27: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 27

Strategies for record deletion (cont.)

3. Deleting variable length records– Use AVAIL LIST as before, but take care of

the variable-length difficulties.– The records in AVAIL LIST must store its

size as a field. Exact byte offset must be used.– Addition of records must find a large enough

record in AVAIL LIST

Page 28: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 28

Placement Strategies• There are several placement strategies for

selecting a record in AVAIL LIST when adding a new record:

1. First-fit Strategy– AVAIL LIST is not sorted by size; the first record

large enough is used for the new record.

2. Best-fit Strategy– List is sorted by size. The smallest record large

enough is used.

3. Worst-fit strategy– List is sorted by decreasing order of size; largest

record is used; unused space is placed in AVAIL LIST again.

Page 29: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 29

Problem 1

• Estimate the time required to reorganize a file for storage compaction. (Derive a formula in terms of the number of blocks, b, in the original file; btt; number of records n in the new file and blocking factor Bfr)

i. Reorganize the file with one disk drive, ie, read and write to the same disk

ii. Reorganize the file with two disk drives, ie, R from one W to the other, with possible R/W overlapping

Page 30: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 30

Problem 2

• Given two pile files A and B with n=100,000 records and R=400 bytes each,

• We want to create an intersection file. • Assume that 70% of the records are in

common and the available memory for this operation is 10MB.

• Calculate a timing estimate for deriving and writing the intersection file (Use s = 16 ms, r = 8.3ms, btt = 0.84ms, B=2400bytes.)

Page 31: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 31

Sorted Sequential Files• Sorted files are usually read sequentially to produce lists,

such as mailing lists, invoices.etc.• A sorted file cannot stay in order it is created after

additions, otherwise it would be very expensive!• A sorted file will have to have an overflow area for added

records. • Overflow area is not sorted, otherwise it would be very

expensive!• To find a record: Use binary search

– First look at sorted area– Then search overflow area– If there are too many overflows, the access time degenerates,

approaching to that of a sequential file.

Page 32: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 32

Searching for a record• We can do binary search (assuming fixed-length

records) in the sorted part.

Sorted part overflow

x blocks y blocks (x + y = b)

•Worst case to fetch a record :

TF = log2 x * (s + r + btt).

•If the record is not found, search the overflow area too. Thus total worst case time is:

TF = log2 x * (s + r + btt) + s + r + y * btt

Page 33: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 33

Searching for a record-2

•If the record is in the sorted part.:

TF = (log2 x-1) * (s + r + btt).

Page 34: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 34

Searching for a record-3

Total expected search time of an existing record:

TF=(x/b)*(logx –1)*(s+r+btt) +(y/b)*( logx*(s+r+btt)+ s+r+(y/2)*btt)

If y is very large compared to x, the second term can be ignored, assuming y is nearly equal to b:

TF=(x/b)*(logx –1)*(s+r+btt ) or TF= (logx –1)*(s+r+btt)

If y is very large, ignoring the rest, the search time will be:

TF=y2/(2b)*btt

Page 35: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 35

Problem 3 Given the following:

– Block size = 2400– Blocking factor=8– File size = 40M– Block transfer time (btt) = 0.84ms– Average Seek time s = 16ms– Average rotational latency r = 8.3 ms

Q1) Calculate TF for a certain recorda) in a pile fileb) in a sorted file (no overflow area)

Q2) Calculate the time to look up 10000 different data items, say names in both files.

Page 36: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 36

Fetching the next record in the order of sort key

o The next record may or may not be in the same block as the current record.

o What is the probability of the next block in the same block: 1-1/m, where m is the blocking factor, i.e., number of records in a block.

o The probability that the current record is the last record in the block is 1/m.

o Thus, the estimated time for reading the next record, assuming that it is either in the same block or in the same cylinder:

TN = (1-1/m)*0+(1/m)*(r+btt)= (1/m)*(r+btt)

Page 37: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 37

Deleting the next record in the order of sort key

If the record to be deleted or updated is any record, not involving the key field, then

TD =TU =TF+2r

o    If the update involves the key field , then a deletion and an insertion are required.

Thus, TU =TD +TI, where TI is time required the

record to be placed in the last block in the overflow area, in which case it requires reading the last block and full rotation to write it back:

TI=s+r+btt+2r

Page 38: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 38

Matching or Merging Processing

• Merging and matching (Co-sequential processing) involves the coordinated processing of two or more sequential files to produce a single output file.– Matching takes intersection of the records in

the files.– Merging takes union of the records in the files.

Page 39: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 39

Examples of applications1. Matching:

i. Finding the intersection fileii. Batch Update

• Master file – bank account info (account number, person name, balance) – sorted by account number

• Transaction file – used to update master file for certain accounts on accounts (account number, credit/debit info) – sorted by account number

2. Merging:i. Merging two class lists keeping alphabetic order.ii. Sorting large files (break into small pieces, sort each

piece and them merge them)

Page 40: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 40

Problem 4: Intersection of two sorted file

• Compute the time required to take intersection of two sorted files, with the following charactristics.

Given the following:– Block size = 2400– Blocking factor=8– File size = 40M– Block transfer time (btt) = 0.84ms– Average Seek time s = 16ms– Average rotational latency r = 8.3 ms

Page 41: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 41

Algorithm for Co-sequential Batch Update

1. Initialize pointers to the first records in the master file and the transaction file.

2. Do until pointers reach end of file:

i. Compare keys of the current records in both files

ii. Take appropriate action *

iii. Advance one (or both) of the pointes.

Page 42: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 42

Appropriate Actionif master key < transaction key

– copy master file record to the end of the new master file.

– advance master file pointer.

if master key > transaction key– if transaction is an insert

copy transaction file record to the end of new master fileelse log an error

– advance the transaction file pointer.

if master key = transaction key– If transaction is modify

copy modified master file record to the end of the new file

– If transaction is insert log an error

– If transaction is delete, do nothing

– In all three cases, advance both the master and transaction file pointer

Page 43: Fall 2005CENG 351 Data Management and File Structures1 Fundamental File Structure Concepts

Fall 2005 CENG 351 Data Management and File Structures 43

Reorganization of a sorted file• Merging the sorted part of a file with its overflow area

– Read in the overflow area blocks, sort the records in memory, if possible, merge the sorted records with sorted part of the file, into another sorted file.

– Assume that there are y blocks in the sorted area and x blocks in the overflow area, and z records in the overflow area.

• TY=x*btt+ Tsort+y*btt+(n/m)*btt• Where n is total number of records, m is the blocking

factor• Tsort is the time it takes to sort z records in the overflow

area, say (zlogz)*t, where t is the time each iteration of the sort routine takes. This time is usually in microsecond domain.