storage systems

Storage Systems – Storage Systems – Part IIPart II

27/10 - 2003

INF5070 – Media Storage and Distribution Systems:

2003 Carsten Griwodz & Pål Halvorsen

INF5070 – media storage and distribution systems

Overview

Data placement

Multiple disks

Managing heterogeneous disks

Prefetching

Memory caching

Data Placement on Disk



Data Placement on Disk Disk blocks can be assigned to files many ways,

and several schemes are designed for

optimized latency increased throughput

access pattern dependent



Disk Layout

Constant angular velocity (CAV) disks

equal amount of data in each track(and thus constant transfer time)

constant rotation speed

Zoned CAV disks zones are ranges of tracks typical few zones the different zones have

different amount of data different bandwidth i.e., more better on outer

tracks



Disk Layout Cheetah X15.3 is a zoned CAV disk:Zon

eCylinders per Zone

Sectors per Track

Spare Cylinder

s

Zone Transfer

Rate Mb/sSectors

per Zone Efficiency

Formatted Capacity (Mbytes)

0 3544 672 7 890,98 19014912 77,2% 9735,6351 3382 652 7 878,43 17604000 76,0% 9013,2483 3079 624 6 835,76 15340416 76,5% 7854,2934 2939 595 6 801,88 13961080 76,0% 7148,0735 2805 576 6 755,29 12897792 78,1% 6603,6696 2676 537 5 728,47 11474616 75,5% 5875,0037 2554 512 5 687,05 10440704 76,3% 5345,6418 2437 480 5 649,41 9338880 75,7% 4781,5069 2325 466 5 632,47 8648960 75,5% 4428,26810 2342 438 5 596,07 8188848 75,3% 4192,690

Always place often used or high rate data on outermost tracks (zone 0) …!? NO, arm movement is often more important than transfer time



Data Placement on Disk Contiguous placement stores disk blocks contiguously on disk

minimal disk arm movement reading the whole file (no intra-file seeks)

possible advantage head must not move between read operations - no seeks or rotational

delays can approach theoretical transfer rate often WRONG: read other files as well

real advantage do not have to pre-determine block (read operation) size

(whatever amount to read, at most track-to-track seeks are performed)

no inter-operation gain if we have unpredictable disk accesses

file A file B file C



Using Adjacent Sectors, Cylinders and Tracks

To avoid seek time (and possibly rotational delay), we can store data likely to be accessed together on

adjacent sectors (similar to using larger blocks)

if the track is full, use another track on the same cylinder (only use another head)

if the cylinder is full, use next (adjacent) cylinder (track-to-track seek)



Data Placement on Disk Interleaved placement tries to store blocks from a file with

a fixed number of other blocks in-between each block

minimal disk arm movement reading the files A, B and C(starting at the same time)

fine for predictable workloads reading multiple files

no gain if we have unpredictable disk accesses

Non-interleaved (or even random) placement can be used for highly unpredictable workloads

file Afile B

file C



Data Placement on Disk Organ-pipe placement consider the usual disk head

position place most popular data where head is most often

center of the disk is closest to the head using CAV disks but, a bit outward for zoned CAV disks (modified organ-pipe)

disk:innermost

outermost

head

bloc

k ac

cess

pro

babi

lity

cylinder number

bloc

k ac

cess

pro

babi

lity

cylinder number

organ-pipe: modified organ-pipe: Note:skew dependent on tradeoff between zoned transfer time and storage capacity vs. seek time



Fast File System FFS is a general file system

idea is to keep inode and associated blocks close(no long seeks when getting the inode and data)

organizes the disks in partitions – cylinder groups having several inodes free block bitmap …

tries to store a file within a cylinder group next block on same cylinder a block within the cylinder group find a block in another group using a hash function search all cylinder groups for a free block



Log-Structured File System Log-structured placement is based on assumptions (facts?)

that RAM memory is getting larger write operations are most expensive reads can often be served from buffer cache (!!??)

Organize disk blocks as a circular log

periodically, all pending (so far buffered) writes are performed as a batch

write on next free block regardless of content (inode, directory, data, …)

a cleaner reorganizes holes and deleted blocks in the background

stores blocks contiguously when writing a single file efficient for small writes, other operations as traditional UNIX FS

disk:



Minorca File System Minorca is a multimedia file system (from IFI/UiO)

enhanced allocation of disk blocks for contiguous storage of media files supports both continuous and non-continuous files in the same system

using different placement policies Multimedia-Oriented Split Allocation (MOSA) –

one file system, two sections cylinder group sections (CGSs) for non-continuous files

like traditional BSD FFS disk partitions small block sizes (like 4 or 8 KB) traditional FFS operations

extent sections for continuous files extents contain one or more (adjacent) CGSs

o summary informationo allocation bitmapo data block area

expected to store one media file large block sizes (e.g., 64 KB) new “transparent” file operations, create file using O_CREATEXT

cylinder group

cylinder group

…

extent

extent

extent

…

extent

extent



Minorca File System Count-augmented address indexing in the extent

section observation: indirect block reads introduce disk I/O and

break access locality (e.g., inode)

introduce a new inode structure add counter field to original direct

entries – direct points to a disk blockand count indicated how many other blocks is following the first block (continuously)

if contiguous allocation is assured, each direct entry is able to access much more blocks without additional retrieving an indirect block

direct 0

direct 1

direct 2

…

direct 10

direct 11

triple indirect

single indirect

double indirect

attributes

count 0

count 1

count 2

…

count 10

count 11



Other File Systems Examples Contiguous Allocation

Presto similar to Minorca extents for continuous files doesn’t support small, discrete files

Fellini simple flat file system maintains free block list with grouping contiguous blocks

Continuous Media File System ...

Several systems use multiple disks and stripe data Symphony Tiger Shark Tiger ...

Multiple Disks



Multiple Disks Disk controllers and busses manage several devices

One can improve total system performance by replacing one large disk with many small accessed in parallel

Several independent heads can read simultaneously(if the other parts of the system can manage the speed)

Single disk:Two disks:Note:the single disk might be faster, but as seek time and rotational delay are the dominant factors of total disk access time, the two smaller disks might operate faster together performing seeks in parallel...



Client1 Client2 Client3 Client4 Client5

Server

Striping Another reason to use multiple disks is when one disk

cannot deliver requested data rate In such a scenario, one

might use several disks for striping:

bandwidth disk: Bdisk required bandwidth: Bdisplay Bdisplay > Bdisk read from n disks in parallel: n Bdisk > Bdisplay clients are serviced in rounds

Advantages high data rates higher transfer rate compared to one disk

Drawbacks can’t serve multiple clients in parallel positioning time increases

(i.e., reduced efficiency)



Interleaving (Compound Striping) Full striping usually not necessary today:

faster disks better compression algorithms

Interleaving lets each client may be serviced by only a set of the available disks make groups ”stripe” data in a way such that

a consecutive request arrive atnext group (here each disk is a group)

Client1 Client2 Client3

Server



Interleaving (Compound Striping) Divide traditional striping group into sub-groups, e.g.,

staggered striping

Advantages multiple clients can still be served in parallel more efficient disks potentially shorter response time

Drawbacks load balancing (all clients access same group)

X0,0 X0,1

X1,0 X1,1

X2,0 X2,1

X3,1 X3,0



Mirroring Multiple disks might come in the situation where all

requests are for one of the disks and the rest lie idle

In such cases, it might make sense to have replicas of data on several disks – if we have identical disks, it is called mirroring

Advantages faster response time survive crashes – fault tolerance load balancing by dividing the requests for the data on the same

disks equally among the mirrored disks

Drawbacks increases storage requirement and write operations



Redundant Array of Inexpensive Disks

The various RAID levels define different disk organizations to achieve higher performance and more reliability RAID 0 - striped disk array without fault tolerance (non-redundant)

RAID 1 - mirroring RAID 2 - memory-style error correcting code (Hamming Code ECC) RAID 3 - bit-interleaved parity RAID 4 - block-interleaved parity RAID 5 - block-interleaved distributed-parity RAID 6 - independent data disks with two independent distributed parity schemes (P+Q

redundancy)

RAID 7 RAID 10 RAID 53 RAID 1+0



Redundant Array of Inexpensive Disks

RAID is intended ... ... for general systems ... to give higher throughput ... to be fault tolerant

For multimedia systems, some requirements are still missing: low latency guaranteed response time optimizations for linear access to large objects optimizations for cyclic operations …



Replication Replication is in traditional RAID systems often used for

fault tolerance (and higher performance in the new combined levels)

Replication in multimedia systems is used for reducing hot spots increase scalability higher performance … and, fault tolerance is often a side effect

Replication in multimedia scenarios should be based on observed load changed dynamically as popularity changes



this sum considers the number of future viewers for this segment

number of viewers of segment j

weighting factor

factor for expected benefit for additional copy

number of replicas of segment i

Dynamic Segment Replication (DSR)

DSR tries to balance load by dynamically replicating hot data assumes read only, VoD-like retrieval predefines a load threshold for when to replicate a segment by

examining current and expected load uses copyback streams replicate when threshold is reached, but which segment and

where?? tries to find a lightly loaded device, based on future load

calculations not necessarily segment that receives additional requests

(another segment may have more requests) replicates based on payoff factor p (replicate segment x with

highest p):1

1

0111

jii

jj

iii wn

rrp



Some Challenges Managing Multiple Disks

How large should a stripe group and stripe unit be?

Can one avoid hot sets of disks (load imbalance)?

What and when to replicate?

Heterogeneous disks?

Heterogeneous Disks



File Placement A multimedia file might be stored (striped) on multiple

disks, but how should one choose on which devices? storage devices limited by both bandwidth and space we have hot (frequently viewed) and cold (rarely viewed) files we may have several heterogeneous storage devices

the objective of a file placement policy is to achieve maximum utilization of both bandwidth and space, and hence, efficient usage of all devices by avoiding load imbalance

must consider expected load and storage requirement should a file be replicated expected load may change over time



Bandwidth-to-Space Ratio (BSR) – I BSR attempts to mix hot and cold as well as large and

small multimedia objects on heterogeneous devices don’t optimize placement based on throughput or space only

BSR consider both required storage space and throughput requirement(which is dependent on playout rate and popularity) to achieve a best combined device utilization

media object:

bandwidth

spac

e

may vary according to popularity

disk(no deviation):

disk (deviation):

wasted space

disk(deviation):

was

ted

ban

dwid

th



Bandwidth-to-Space Ratio (BSR) – II The BSR policy algorithm:

input: space and bandwidth requirements

phase 1: find a device to place the media object according to BSR if no device, or stripe of devices, can give sufficient space or bandwidth,

then add replicas phase 2:

find devices for the needed replicas phase 3:

allocate expected load on replica devices according to BSR of the devices phase 4:

if not enough resources are available, see if other media objects can delete replicas according to their current workload

all phases may be needed adding a new media object or increasing the workload – for decrease, only the phase 3 (reallocation) in needed

Popular, high data rate movies should be on high bandwidth disks



Disk Grouping Disk grouping is a technique to “stripe” (or fragment) data over

heterogeneous disks groups heterogeneous physical disks to homogeneous logical disks the amount of data on each disk (fragments) is determined so that the

service time (based on worst-case seeks) is equal for all physical disks in a logical disk

blocks for an object are placed (and read) on logical disks in a round-robin manner – all disks in a group is activated simultaneously

disk 2

disk 3

disk 0

disk 1

logical disk 0

logical disk 1

X0,0

X0 X2

X1 X3

X2,0

X0,1 X2,1

X1,0 X3,0

X1,1 X3,1

X0,0

X0,1

X1,0

X1,1

X2,0

X2,1

X0 readyfor display

X1 readyfor display



Staggered Disk Grouping Staggered disk grouping is a variant of disk grouping minimizing

memory requirement reading and playing out differently not all fragments of a logical block is needed at the same time first (and largest) fragment on most powerful disk, etc. read sequentially (must not buffer later segments for a long time) start display when largest fragment is read

disk 2

disk 3

disk 0

disk 1

logical disk 0

logical disk 1

X0,0

X0 X2

X1 X3

X2,0

X0,1 X2,1

X1,0 X3,0

X1,1 X3,1

X0,0 X0,1

X1,0 X1,1

X2,0 X2,1

X0,0 readyfor display





Disk Merging Disk merging forms logical disks form capacity fragments of a physical disk

all logical disks are homogeneous supports an arbitrary mix of heterogeneous disks (grouping needs equal groups) starts by choosing how many logical disks the slowest device shall support

(e.g., 1 for disk 1 and 3) and calculates the corresponding number of more powerful devices (e.g., 1.5 for disk 0 and 2 if these disks are 1.5 times better)

most powerful: most flexible (arbitrary mix of devices) and can be adapted to zoned disks (each zone considered as a disk)

disk 2

disk 3

disk 0

disk 1

X0 X2,0

X1

X2,1 X3

X4

X0

X readyfor display

logical disk 0

X0

logical disk 1

X1

logical disk 3

X3

logical disk 2

X2

logical disk 4

X4

X1

X2

X3

X4

Prefetching and Buffering



Prefetching If we can predict the access pattern, one might speed up performance

using prefetching a video playout is often linear easy to predict access pattern

eases disk scheduling read larger amounts of data per request data in memory when requested – reducing page faults

One simple (and efficient) way of doing prefetching is read-ahead: read more than the requested block into memory serve next read requests from buffer cache

Another way of doing prefetching is double (multiple) buffering: read data into first buffer process data in first buffer and at the same time read data into second

buffer process data in second buffer and at the same time read data into first

buffer etc.



process data

Multiple Buffering Example:

have a file with block sequence B1, B2, ...our program processes data sequentially, i.e., B1, B2, ...

single buffer solution: read B1 buffer process data in buffer read B2 buffer process data in Buffer ...

if P = time to process/blockR = time to read in 1 blockn = # blocks

single buffer time = n (P+R)

disk:

memory:



Multiple Buffering double buffer solution:

read B1 buffer1 process data in buffer1, read B2 buffer2 process data in buffer2, read B3 buffer1 process data in buffer1, read B4 buffer2 ...

if P = time to process/blockR = time to read in 1 blockn = # blocks

if P R double buffer time = R + nP

if P < R, we can try to add buffers (n - buffering)

process data

disk:

memory:

process data

Memory Caching



Is Caching Useful in a Multimedia Scenario?

High rate data may need lots of memory for caching…

Tradeoff: amount of memory, algorithms complexity, gain, …

Cache only frequently used data – how?(e.g., first (small) parts of a broadcast partitioning scheme, allow on “top-ten” only, …)

Buffer vs. Rate

160 Kbps(e.g., MP3)

1.4 Mbps (e.g., uncompressed

CD)

3.5 Mbps (e.g., average DVD

video)

100 Mbps (e.g., uncompressed

HDTV)

100 MB 85 min 20 s 9 min 31 s 3 min 49 s 8 s

1 GB 14 hr 33 min 49 s 1 hr 37 min 31 s 39 min 01 s 1 min 22 s

16 GB 133 hr 01 min 01 s

26 hr 00 min 23 s

10 hr 24 min 09 s 21 min 51 s

32 GB 266 hr 02 min 02 s

52 hr 00 min 46 s

20 hr 48 min 18 s 43 min 41 sMaximum amount of memory (totally)

that a Dell Server can manage in 2002 – and all is NOT used for caching



Need For Special “Multimedia Algorithms” ?

Most existing systems use an LRU-variant, e.g., keep a sorted list replace first in list insert new data elements at the end if a data element is re-accesses, move back to the end of the list

Extreme example – video frame playout:LRU buffer

longest time

since accessshortest time

since access

play video (7 frames): 1234567

rewind and restart playout at 1: 7 6 5 4 3 21

playout 2: 1 7 6 5 4 32

playout 3: 2 1 7 6 5 43

playout 4: 3 2 1 7 6 54

In this case, LRU replaces the next needed frame. So the answer is in many cases YES…



“Classification” of Mechanisms Block-level caching consider (possibly unrelated) set of blocks

each data element is viewed upon as an independent item usually used in “traditional” systems e.g., FIFO, LRU, CLOCK, …

multimedia approaches: Least/Most Relevant for Presentation (L/MRP) …

Stream-dependent caching consider a stream object as a whole related data elements are treated in the same way research prototypes in multimedia systems e.g.,

BASIC DISTANCE Interval Caching (IC) Generalized Interval Caching (GIC) Split and Merge (SAM) SHR



current presentation point

Least/Most Relevant for Presentation (L/MRP) L/MRP is a buffer management mechanism for a single interactive,

continuous data stream adaptable to individual multimedia applications supports pre-loading, i.e., prefetch data from disk replaces least relevant pages regarding current playout of the multimedia stream

[Moser et al. 95]

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

COPUs – continuous object presentation units

1011

2021

26

COPU number10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

relevance value1.0

0

0.8

0.6

0.4

0.2

X referenced

X history

playback direction

1213

1415 16 17 18 19

2524

2322

X skipped

16 1820

22

24

26



Least/Most Relevant for Presentation (L/MRP) L/MRP …

… gives “few” disk accesses (compared to other schemes) … supports interactivity … supports prefetching

… targeted for single streams (users) … expensive (!) to execute

(calculate relevance values for all COPUs each round)

Variations: Q-L/MRP – extends L/MRP with multiple streams and changes

prefetching mechanism (reduces overhead) [Halvorsen et. al. 98] MPEG-L/MRP – gives different relevance values for different

MPEG frames [Boll et. all. 00]



Interval Caching (IC) Interval caching (IC) is a caching strategy for streaming servers

caches data between requests for same video stream – based on playout intervals between requests

following requests are thus served from the cache (not disk) filled by the preceding stream

sort intervals on length, buffer requirement is data size of interval to maximize cache hit ratio (minimize disk accesses) the shortest intervals

are cached first

Video clip 1

S11

Video clip 1

S11S12

Video clip 1

S12 S11S13

Video clip 2

S22 S21

Video clip 3

S33 S31S32S34

I11I12

I21

I31I32I33

: S32 S33 S21S11S31S12



Generalized Interval Caching (GIC) Interval caching (IC) does not work for short clips

a frequently accessed short clip will not be cached GIC generalizes the IC strategy

manages intervals for long video objects as IC short intervals extend the interval definition

keep track of a finished stream for a while after its termination define the interval for short stream as the length between the new stream and the position of the old

stream if it had been a longer video object the cache requirement is, however, only the real requirement

cache the shortest intervals as in IC

Video clip 1

S11S12

I11C11

S11

Video clip 2

S22 S21

I21



Generalized Interval Caching (GIC) Open function:

form if possible new interval with previous stream;if (NO) {exit} /* don’t cache */compute interval size and cache requirement;reorder interval list; /* smallest first */if (not already in a cached interval) {

if (space available) {cache interval}else if (larger cached intervals existand sufficient memory can be released) {

release memory form larger intervals;cache new interval;

} } Close function

if (not following another stream) {exit} /* not served from cache */ delete interval with preceding stream;

free memory;if (next interval can be cached in released memory) {

cache next interval}

The End:Summary



Summary Much work has been performed to optimize disks

performance

For multimedia streams, ... time-aware scheduling is important use large block sizes or read many continuous blocks prefetch data from disk to memory to have a hiccup free playout striping might not be necessary on new disks (at least not on all

disks) replication on multiple disks can offload a hot spot of disks memory caching can save disk I/Os, but it might not be worth the

effort ...

BUT, new disks are “smart”, we cannot fully control the device



Some References1. Advanced Computer & Network Corporation: “RAID.edu”, http://www.raid.com/04_00.html,

20022. Boll, S., Heinlein, C., Klas, W., Wandel, J.: “MPEG-L/MRP: Adaptive Streaming of MPEG Videos

for Interactive Internet Applications”, Proceedings of the 6th International Workshop on Multimedia Information System (MIS’00), Chicago, USA, October 2000, pp. 104 - 113

3. Halvorsen, P., Goebel, V., Plagemann, T.: “Q-L/MRP: A Buffer Management Mechanism for QoS Support in a Multimedia DBMS”, Proceedings of 1998 IEEE International Workshop on Multimedia Database Management Systems (IW-MMDBMS'98), Dayton, Ohio, USA, August 1998, pp. 162 - 171

4. Moser, F., Kraiss, A., Klas, W.: “L/MRP: a Buffer Management Strategy for Interactive Continuous Data Flows in a Multimedia DBMS”, Proceedings of the 21th VLDB Conference, Zurich, Switzerland, 1995

5. Plagemann, T., Goebel, V., Halvorsen, P., Anshus, O.: “Operating System Support for Multimedia Systems”, Computer Communications, Vol. 23, No. 3, February 2000, pp. 267-289

6. Sitaram, D., Dan, A.: “Multimedia Servers – Applications, Environments, and Design”, Morgan Kaufmann Publishers, 2000

7. Zimmermann, R., Ghandeharizadeh, S.: “Continuous Display using Heterogeneous Disk-Subsystems”, Proceedings of the 5th ACM International Multimedia Conference, Seattle, WA, November 1997

storage systems – part ii

Documents

disk disk blocks

disk minimal disk arm

diskinterleaved placement

random placement

adjacent cylinder track

larger blocks

multiple files