unit-5 distributed & parallel databases · unit-5 distributed & parallel databases 5.1...

UNIT-5 DISTRIBUTED & PARALLEL DATABASES

5.1 CONCURRENCY CONTROL

In a multiprogramming environment where multiple transactions can be executed simultaneously, it is highly important to control the concurrency of transactions. Concurrency control manages simultaneous access to a database. It prevents two users from editing the same record at the same time and also serializes transactions for backup and recovery.

5.1.1 Potential problems of Concurrency: Here, are some issues which you will likely to face while using the Concurrency Control method: Lost Updates occur when multiple transactions select the same row and update the row based on the value selected Uncommitted dependency issues occur when the second transaction selects a row which is updated by another transaction (dirty read) Non-Repeatable Read occurs when a second transaction is trying to access the same row several times and reads different data each time. Incorrect Summary issue occurs when one transaction takes summary over the value of all the instances of a repeated data-item, and second transaction update few instances of that specific data-item. In that situation, the resulting summary does not reflect a correct result.

5.1.2 Concurrency control protocols: can be broadly divided into following categories 1. Lock based protocols/ Locking 2. Time stamp based protocols 3. Optimistic Methods

1. Lock based protocols/ Locking: A database lock is used to “lock” some data in a database so that only one database user/session may update that particular data. So, database locks exist to prevent two or more database users from updating the same exact piece of data at the same exact time. When data is locked, then that means that another database session can NOT update that data until the lock is released. Locks are usually released by either a ROLLBACK or COMMIT SQL statement. Multiple transactions may request a lock on a data item simultaneously. Hence, we require a mechanism to manage the locking requests made by transactions. Such a mechanism is called as Lock Manager.

Need of locking: In a multiprogramming environment where multiple transactions can be executed simultaneously, it is highly important to control the concurrency of transactions. We have concurrency control protocols to ensure atomicity, isolation, and serializability of concurrent transactions.

Lock Granularity: A database is basically represented as a collection of named data items. The size of the data item chosen as the unit of protection by a concurrency control program is called granularity. Locking can take place at the following level.

i. Database level Locking: At database level locking, the entire database is locked. Thus, it prevents the use of any tables in the database by transaction T2 while transaction T1 is

Database Management System Unit-5: Distributed & Parallel Databases

Prepared By: Mr. V. K. Wani 2

being executed. Database level of locking is suitable for batch processes. Being very slow, it is unsuitable for on-line multi-user DBMSs.

ii. Table level Locking: At table level locking, the entire table is locked. Thus, it prevents the access to any row (tuple) by transaction T2 while transaction T1 is using the table. if a transaction requires access to several tables, each table may be locked. However, two transactions can access the same database as long as they access different tables. Table level locking is less restrictive than database level. Table level locks are not suitable for multi-user DBMS

iii. Page level Locking: At page level locking, the entire disk-page (or disk-block) is locked. A page has a fixed size such as 4 K, 8 K, 16 K, 32 K and so on. A table can span several pages, and a page can contain several rows (tuples) of one or more tables. Page level of locking is most suitable for multi-user DBMSs.

iv. Row (Tuple) level Locking: At row level locking, particular row (or tuple) is locked. A lock exists for each row in each table of the database. The DBMS allows concurrent transactions to access different rows of the same table, even if the rows are located on the same page. The row level lock is much less restrictive than database level, table level, or page level locks. The row level locking improves the availability of data. However, the management of row level locking requires high overhead cost.

v. Attributes (fields) level Locking: At attribute level locking, particular attribute (or field) is locked. Attribute level locking allows concurrent transactions to access the same row, as long as they require the use of different attributes within the row. The attribute level lock yields the most flexible multi-user data access. It requires a high level of computer overhead.

Types of Locks: The DBMS mainly uses following types of locking techniques.

a. Binary Locking: A binary lock can have two states or values: locked and unlocked (or 1 and 0, for simplicity). A distinct lock is associated with each database item X. If the value of the lock on X is 1, item X cannot be accessed by a database operation that requests the item. If the value of the lock on X is 0, the item can be accessed when requested. We refer to the current value (or state) of the lock associated with item X as LOCK(X). Two operations, lock_item and unlock item, are used with binary locking. Lock_item(X): A transaction requests access to an item X by first issuing a lock_item(X) operation. If LOCK(X) = 1, the transaction is forced to wait. If LOCK(X) = 0, it is set to 1 (the transaction locks the item) and the transaction is allowed to access item X. Unlock_item (X): When the transaction is through using the item, it issues an unlock_item(X) operation, which sets LOCK(X) to 0 (unlocks the item) so that X may be accessed by other transactions.

b. Shared / Exclusive Locking:

Shared lock: Shared locks exist when two transactions are granted read access. One transaction gets the shared lock on data and when the second transaction requests the same data it is also given a shared lock. Both transactions are in a read-only mode, updating the data is not allowed until the shared lock is released. These locks are referred as read locks, and denoted by 'S'. If a transaction T has obtained Shared-lock on data item X, then T can read



X, but cannot write X. Multiple Shared lock can be placed simultaneously on a data item. It can be granted to multiple transaction simultaneously which allows them to read only.

Exclusive lock: When a statement modifies data, its transaction holds an exclusive lock on data that prevents other transactions from accessing the data. This lock remains in place until the transaction holding the lock issues a commit or rollback. Table-level locking lowers concurrency in a multi-user system. These Locks are referred as Write locks, and denoted by 'X'. If a transaction T has obtained Exclusive lock on data item X, then T can be read as well as write X. Only one Exclusive lock can be placed on a data item at a time. This means multiple transactions does not modify the same data simultaneously. . It can be granted to single transaction only at a time which allows them to read as well write.

a. Two-Phase Locking (2PL): In databases and transaction processing, 2PL is a concurrency control method that guarantees serializability. The protocol utilizes locks, applied by a transaction to data, which may block other transactions from accessing the same data during the transaction's life. By the 2PL protocol, locks are applied and removed in two phases:

1. Expanding phase: locks are acquired and no locks are released. 2. Shrinking phase: locks are released and no locks are acquired.

A shrinking phase, in which a transaction releases all locks and cannot obtain any new lock. A transaction shows Two-Phase Locking Technique.

Time Transaction Remarks t0 Lock - X (A) acquire Exclusive lock on A. t1 Read A read original value of A t2 A = A - 100 subtract 100 from A t3 Write A write new value of A t4 Lock - X (B) acquire Exclusive lock on B. t5 Read B read original value of B t6 B = B + 100 add 100 to B

2. Time-Stamp Methods for Concurrency control: Timestamp is a unique identifier created by the DBMS to identify the relative starting time of a transaction. Typically, timestamp values are assigned in the order in which the transactions are submitted to the system. So, a timestamp can be thought of as the transaction start time. Therefore, time stamping is a method of concurrency control in which each transaction is assigned a transaction timestamp. The older transaction is always given priority in this method This is the most commonly used concurrency protocol. Timestamps must have two properties namely Uniqueness: The uniqueness property assures that no equal timestamp values can exist. Monotonicity: monotonicity assures that timestamp values always increase. Example: Suppose there are there transactions T1, T2, and T3. T1 has entered the system at time 0010 T2 has entered the system at 0020 T3 has entered the system at 0030 Priority will be given to transaction T1, then transaction T2 and lastly Transaction T3.

Disadvantages:



Each value stored in the database requires two additional timestamp fields, one for the last time the field was read and one for the last update. It increases the memory requirements and the processing overhead of database. Starvation is possible if the same transaction is restarted and continually aborted.

3. Optimistic Concurrency Control Algorithm: The optimistic method of concurrency control is based on the assumption that conflicts of database operations are rare and that it is better to let transactions run to completion and only check for conflicts before they commit. An optimistic concurrency control method is also known as validation or certification methods. No checking is done while the transaction is executing. The optimistic method does not require locking or time stamping techniques. Instead, a transaction is executed without restrictions until it is committed. This approach is called optimistic concurrency control technique. In this approach, a transaction’s life cycle is divided into the following three phases:

Execution Phase: Also called Read Phase, A transaction fetches data items to memory and performs operations upon them. All update operations of the transaction are recorded in a temporary update file, which is not accessed by the remaining transactions. These set of transactions who have finished their Read phases since the start of the transaction being verified

Validation Phase: Also called Certified Phase, A transaction performs checks to ensure that committing its changes to the database passes serializability test. If conflicts are detected in this phase, the transaction is aborted and restarted. The validation algorithm must check that the transaction has Seen all modifications of transactions committed after it starts and Not read granules updated by a transaction committed after its start.

Commit Phase: Also called write Phase, in which changes are permanently applied to the database and the updated granules are made public. Otherwise, the updates are discarded and the transaction is restarted. This phase is only for the Read-Write transactions and not for Read-only transactions.

This algorithm uses three rules to enforce serializability in validation phase:

Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing, then Ti’s execution phase cannot overlap with Tj’s commit phase. Tj can commit only after Ti has finished execution. Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tjis reading, then Ti’s commit phase cannot overlap with Tj’s execution phase. Tjcan start executing only after Ti has already committed. Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also writing, then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can start to commit only after Ti has already committed. Advantages of Optimistic Methods for Concurrency Control:

This technique is very efficient when conflicts are rare. The occasional conflicts result in the transaction roll back.



The rollback involves only the local copy of data, the database is not involved and thus there will not be any cascading rollbacks.

Problems of Optimistic Methods for Concurrency Control:

Conflicts are expensive to deal with, since the conflicting transaction must be rolled back. Longer transactions are more likely to have conflicts and may be repeatedly rolled back because of conflicts with short transactions.

5.2 DEADLOCKS:

A deadlock is a condition in which two or more transactions in a set are waiting simultaneously for locks held by some other transaction in the set. Neither transaction can continue because each transaction in the set is on a waiting queue. Thus, a deadlock is an impasse that may result when two or more transactions are each waiting for locks to be released that are held by the other. A deadlock is also called a circular waiting condition where two transactions are waiting (directly or indirectly) for each other. Thus in a deadlock, two transactions are mutually excluded from accessing the next record required to complete their transactions, also called a deadly embrace. Example: A deadlock exists two transactions A and B exist in the following example: Transaction A = access data items X and Y Transaction B = access data items Y and X Here, Transaction-A has acquired lock on X and is waiting to acquire lock on y. While, Transaction-B has acquired lock on Y and is waiting to acquire lock on X. But, none of them can execute further.

Transaction-A Time Transaction-B --- t0 ---

Lock (X) (acquired lock on X) t1 --- --- t2 Lock (Y) (acquired lock on Y)

Lock (Y) (request lock on Y) t3 --- Wait t4 Lock (X) (request lock on X) Wait t5 Wait Wait t6 Wait Wait t7 Wait

5.2.1 Deadlock Detection and Prevention:

Deadlock detection: This technique allows deadlock to occur, but then, it detects it and solves it. Here, a database is periodically checked for deadlocks. If a deadlock is detected, one of the transactions, involved in deadlock cycle, is aborted. other transaction continue their execution. An aborted transaction is rolled back and restarted. Deadlock Prevention: To prevent any deadlock situation in the system, the DBMS aggressively inspects all the operations, where transactions are about to execute. The DBMS inspects the operations and analyses if they can create a deadlock situation. If it finds that a deadlock situation might occur, then that transaction is never allowed to be executed. There are deadlock prevention schemes that use timestamp ordering mechanism of transactions in order to predetermine a deadlock situation.



Wait-Die Scheme: In this scheme, if a transaction requests to lock a resource (data item), which is already held with a conflicting lock by another transaction, then one of the two possibilities may occur −

If TS(Ti) < TS(Tj) − that is Ti, which is requesting a conflicting lock, is older than Tj − then Ti is allowed to wait until the data-item is available.

If TS(Ti) > TS(tj) − that is Ti is younger than Tj − then Ti dies. Ti is restarted later with a random delay but with the same timestamp.

This scheme allows the older transaction to wait but kills the younger one.

Wound-Wait Scheme: In this scheme, if a transaction requests to lock a resource (data item), which is already held with conflicting lock by some another transaction, one of the two possibilities may occur −

If TS(Ti) < TS(Tj), then Ti forces Tj to be rolled back − that is Tiwounds Tj. Tj is restarted later with a random delay but with the same timestamp.

If TS(Ti) > TS(Tj), then Ti is forced to wait until the resource is available.

This scheme, allows the younger transaction to wait; but when an older transaction requests an item held by a younger one, the older transaction forces the younger one to abort and release the item. In both the cases, the transaction that enters the system at a later stage is aborted.

Here, we can use any of the two following approaches −

First, do not allow any request for an item, which is already locked by another transaction. This is not always feasible and may cause starvation, where a transaction indefinitely waits for a data item and can never acquire it.

The second option is to roll back one of the transactions. It is not always feasible to roll back the younger transaction, as it may be important than the older one. With the help of some relative algorithm, a transaction is chosen, which is to be aborted. This transaction is known as the victim and the process is known as victim selection. 5.3 FAILURES & RECOVERY IN DATABASE: 5.3.1 Failures: It is a situation which may occur if transaction or any operation crashed. The failure may occurs because of any of the following reasons:

1. Transaction failure: A transaction needs to abort once it fails to execute or once it reaches to any further extent from wherever it can’t go to any extent further. This is often known as



transaction failure wherever solely many transactions or processes are hurt. The reasons for transaction failure are:

a. Logical errors: Where a transaction cannot complete as a result of its code error or an internal error condition.

b. System errors: Wherever the information system itself terminates an energetic transaction as a result of the DBMS isn’t able to execute it, or it’s to prevent due to some system condition. to Illustrate, just in case of situation or resource inconvenience, the system aborts an active transaction.

2. System crash: There are issues − external to the system − that will cause the system to prevent abruptly and cause the system to crash. For instance, interruptions in power supply might cause the failure of underlying hardware or software package failure. Examples might include OS errors.

3. Disk failure: In early days of technology evolution, it had been a typical drawback wherever hard-disk drives or storage drives accustomed to failing oftentimes. Disk failures include the formation of dangerous sectors, unreachability to the disk, disk crash or the other failure, that destroys all or a section of disk storage.

5.3.2 Recovery in Database: Database systems, like any other computer system, are subject to failures but the data stored in it must be available as and when required. When a database fails it must possess the facilities for fast recovery. The techniques used to recover the lost data due to system crash, transaction errors, viruses, catastrophic failure, incorrect commands execution etc. are database recovery techniques. So to prevent data loss recovery techniques based on deferred update and immediate update or backing up data can be used. The Recovery techniques are as follow:

5.3.2.1 Log-Based Recovery: Logs are the sequence of records, that maintain the records of actions performed by a transaction. In Log – Based Recovery, log of each transaction is maintained in some stable storage. If any failure occurs, it can be recovered from there to recover the database. The log contains the information about the transaction being executed, values that have been modified and transaction state. All these information will be stored in the order of execution. Example: Assume, a transaction to modify the address of an employee. The following logs are written for this transaction, Log 1: Transaction is initiated, writes 'START' log. Log: <Tn START> Log 2: Transaction modifies the address from 'Pune' to 'Mumbai'. Log: <Tn Address, 'Pune', 'Mumbai'> Log 3: Transaction is completed. The log indicates the end of the transaction. Log: <Tn COMMIT> There are two methods of creating the log files and updating the database. 1. Deferred Database Modification. 2. Immediate Database Modification.

1. In Deferred Database Modification, all the logs for the transaction are created and stored into stable storage system. In the above example, three log records are created and stored it



in some storage system, the database will be updated with those steps.

2. In Immediate Database Modification, after creating each log record, the database is modified for each step of log entry immediately. In the above example, the database is modified at each step of log entry that means after first log entry, transaction will hit the database to fetch the record, then the second log will be entered followed by updating the employee's address, then the third log followed by committing the database changes.

Checkpoint: Keeping and maintaining logs in real time and in real environment may fill out all the memory space available in the system. As time passes, the log file may grow too big to be handled at all. Checkpoint is a mechanism where all the previous logs are removed from the system and stored permanently in a storage disk. Checkpoint declares a point before which the DBMS was in consistent state, and all the transactions were committed. When a system with concurrent transactions crashes and recovers, it behaves in the following manner −

The recovery system reads the logs backwards from the end to the last checkpoint. It maintains two lists, an undo-list and a redo-list. If the recovery system sees a log with <Tn, Start> and <Tn, Commit> or just <Tn, Commit>, it puts the transaction in the redo-list. If the recovery system sees a log with <Tn, Start> but no commit or abort log found, it puts the transaction in undo-list. All the transactions in the undo-list are then undone and their logs are removed. All the transactions in the redo-list and their previous logs are removed and then redone before saving their logs.

5.3.2.2 Shadow Paging: Shadow paging is an alternative to log-based recovery techniques. It may require fewer disk accesses, but it is hard to extend paging to allow multiple concurrent transactions. A page in this context refers to a unit of physical storage typically of the order of 1 to 64 KB. The idea is to maintain two page tables during the life of a transaction: the current page table which is volatile and the shadow page table considered as non-volatile . When the transaction starts, the page contents from disk are copied to both the tables so that both tables are identical. The shadow page is never changed during the life of the transaction. The current page is updated with each write operation. Each table entry points to a page on the disk. When the transaction is committed, the contents of current page are copied in to shadow page and the disk block with the old data is released. For pages updated by the transaction, two versions are kept. The old version is referenced by the shadow directory and the new version by the current directory. If the shadow is stored in non-volatile memory and a system crash occurs, then the shadow page table is copied to the current page table. This guarantees that the shadow page table will point to the database pages corresponding to the



state of the database prior to any transaction that was active at the time of the crash, making aborts automatic. Since recovery involves neither undoing nor redoing data items, this technique can be categorized as a NO-UNDO/NO-REDO technique for recovery.

Advantages:

No Overhead for writing log records. No Undo / No Redo algorithm. Recovery is faster.

Disadvantages 1. Commit overhead. The commit of a single transaction using shadow paging requires multiple blocks to be output the current page table, the actual data and the disk address of the current page table. Log-based schemes need to output only the log records. 2. Data fragmentation. Shadow paging causes database pages to change locations (therefore, no longer contiguous. 3. Garbage collection. Each time that a transaction commits, the database pages containing the old version of data changed by the transactions must become inaccessible. Such pages are considered to be garbage since they are not part of the free space and do not contain any usable information. Periodically it is necessary to find all of the garbage pages and add them to the list of free pages. This process is called garbage collection and imposes additional overhead and complexity on the system.

5.4 CENTRALIZED AND CLIENT–SERVER ARCHITECTURES 5.4.1 Centralized Architectures: A centralized database (CDB) is a database that is located, stored, and maintained in a single location. This location is most often a central computer or database system, for example a desktop or server CPU, or a mainframe computer. A modern, general-purpose computer system consists of one to a few processors and a number of device controllers that are connected through a common bus that provides access to shared memory. The processors have local cache memories that store local copies of parts of the



memory, to speed up access to data. Each processor may have several independent cores, each of which can execute a separate instruction stream. Each device controller is in charge of a specific type of device like disk drive, an audio device, or a video display. The processors and the device controllers can execute concurrently, competing for memory access. We distinguish two ways in which computers are used: as single-user systems and as multiuser systems. Personal computers and workstations fall into the first category. A typical single-user system is a desktop unit used by a single person. A typical multiuser system, on the other hand, has more disks and more memory and may have multiple processors. It serves a large number of users who are connected to the system remotely. Database systems designed for use by single users usually do not provide many of the facilities that a multiuser database provides. Provisions for crash recovery in such systems are either absent or primitive for example, they may consist of simply making a backup of the database before any update. Although most general-purpose computer systems in use today have multiple processors, they have coarse-granularity parallelism, with only a few processors all sharing the main memory. Databases running on such machines usually do not attempt to partition a single

query among the processors; instead, they run each query on a single processor, allowing multiple queries to run concurrently. Thus, such systems support a higher throughput; that is, they allow a greater number of transactions to run per second, although individual transactions do not run any faster. Databases designed for single-processor machines already provide multitasking, allowing multiple processes to run on the same processor in a time-shared manner, giving a view to the user of multiple processes running in parallel. Thus, coarse-granularity parallel machines logically appear to be identical to single processor machines, and database systems designed for time-shared machines can be easily adapted to run on them. In contrast, machines with fine-granularity parallelism have a large number of processors, and database systems running on such machines attempt to parallelize single tasks submitted by users.

5.4.2 Client–Server Architecture: Client-server architecture, architecture of a computer network in which many clients (remote processors) request and receive service from a centralized server (host computer). Client computers provide an interface to allow a computer user to request services of the server and to display the results the server returns. Functionality provided by database systems can be broadly divided into two parts: the front end and the back end. The back end manages access structures, query evaluation and optimization, concurrency control, and recovery. The front end of a database system consists of tools such as the SQL user interface, forms interfaces, report generation tools, and data



mining and analysis tools. The interface between the front end and the back end is through SQL, or through an application program

Certain application programs, such as spread sheets and statistical-analysis packages, use the client–server interface directly to access data from a back-end server. In effect, they provide front ends specialized for particular tasks. Systems that deal with large numbers of users adopt three-tier architecture. The application server, in effect, acts as a client to the database server. Some transaction-processing systems provide a transactional remote procedure call interface to connect clients with a server. These calls appear like ordinary procedure calls to the programmer, but all the remote procedure calls from a client are enclosed in a single transaction at the server end. Thus, if the transaction aborts, the server can undo the effects of the individual remote procedure calls

5.4.2.1 Two Tier & Three Tier Client–Server Architecture:

1. Two tier architecture: is similar to a basic client-server model. The application at the client end directly communicates with the database at the server side. API’s like ODBC,JDBC are used for this interaction. The server side is responsible for providing query processing and transaction management functionalities. On the client side, the user interfaces and application programs are run. The application on the client side establishes a connection with the server side in order to communicate with the DBMS. An advantage of this type is that maintenance and understanding is easier, compatible with existing systems. However this model gives poor performance when there are a large number of users.

2. Three Tier architecture: In this type, there is another layer between the client and the server. The client does not directly communicate with the server. Instead, it interacts with an application server which further communicates with the database system and then the query processing and transaction management takes place. This intermediate layer acts as a medium for exchange of partially processed data between server and client. This type of architecture is used in case of large web applications.

Figure: 2-Tier Architecture Figure: 3-Tier Architecture

Print Server File Server DBMS Server

Client Client Client Client

Network



5.5 INTRODUCTION TO PARALLEL DATABASES Parallel systems improve processing and I/O speeds and performance by using multiple processors and disks in parallel. The driving force behind parallel database systems is the demands of applications that have to query extremely large databases of the order of TB or that have to process an extremely large number of transactions per second of the order of thousands of transactions per second. Centralized and client –server database systems are not powerful enough to handle such applications. In parallel processing, many operations are performed simultaneously, as opposed to serial processing, in which the computational steps are performed sequentially. Parallel computers with hundreds of processors and disks are available commercially. There are two main measures of performance of a database system:

(1) Throughput, the number of tasks that can be completed in a given time interval, and (2) Response time, the amount of time it takes to complete a single task from the time it is submitted.

Speedup and Scale-up: Two important issues in studying parallelism are speedup and scale-up. Running a given task in less time by increasing the degree of parallelism is called speedup. Handling larger tasks by increasing the degree of parallelism is called scaleup.

Example Speed-Up: Adding more resources results in proportionally less running time for a fixed amount of data. 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

Example Scale-Up: If resources are increased in proportion to an increase in data/problem size, the overall time should remain constant – 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs

5.5.1 The Various architectures of parallel databases are as following:

5.5.1.1 Shared memory architecture: Where multiple processors simultaneously share the main memory (RAM) space but each processor has its own disk (HDD). If many processes run, the speed is reduced, the same as a computer when many parallel tasks run and the computer slows down. In shared-memory architecture, the processors and disks have access to a common memory, typically via a bus or through an interconnection network. The benefit of shared memory is extremely efficient communication between processors data in shared memory can be accessed by any processor without being moved with software. A processor can send messages to other processors much faster by using memory writes than by sending a message through a communication mechanism. This architecture is not scalable beyond 32 or 64 processors because the bus or the interconnection network becomes a bottleneck

P P P

Interconnected Network

Global Shared Memory

D D D

Advantages: 1. It is closer to conventional machine , Easy to program 2. overhead is low. 3. OS services are leveraged to utilize the additional CPUs. Disadvantage: 1. It leads to bottleneck problem 2. Expensive to build 3. It is less sensitive to partitioning and is not Scalable.



5.5.1. 2 Shared disk architecture: Where each node has its own main memory, but all nodes share mass storage, usually a storage area network. In practice, each node usually also has multiple processors. In the shared-disk model, all processors can access all disks directly via an interconnection network, but the processors have private memories. It offers a cheap way to provide a degree of fault tolerance: If a processor fails, the other processors can take over its tasks, since the database is resident on disks that are accessible from all processors.

5.5.1.3 Shared nothing architecture: In a shared-nothing system, each node of the machine consists of a processor, memory, and one or more disks. The processors at one node may communicate with another processor at another node by a high-speed interconnection network. A node functions as the server for the data on the disk or disks that the node owns. Since local disk references are serviced by local disks at each processor, the shared-nothing model overcomes the disadvantage of requiring all I/O to go through a single interconnection network; only queries, accesses to nonlocal disks, and result relations pass through the network. Moreover, the interconnection networks for shared-nothing systems are usually designed to be scalable, so that their transmission capacity increases as more nodes are added.

5.5.1.4 Hierarchical architecture: It combines the characteristics of shared-memory, shared-disk, and shared-nothing architectures. At the top level, the system consists of nodes that are connected by an interconnection network and do not share disks or memory with one another. Thus, the top level is a shared-nothing architecture. Each node of the system could actually be a shared-memory system with a few processors. Alternatively, each node could be

P P P


D D D

M M M

P P P


D D D

M M M

Advantages: 1. since each processor has its own memory, the memory bus is not a bottleneck 2. it offers a cheap way to provide a degree of fault tolerance Disadvantages: 1. More interference 2. Increases N/W band width 3. Shared disk less sensitive to partitioning

Advantages: 1. It provides linear scale up &linear speed up 2. Shared nothing benefits from "good" partitioning 3. Cheap to build Disadvantage 1. Hard to program 2. Addition of new nodes requires reorganizing 3. costs of communication and of nonlocal disk access, higher than shared-memory or shared-disk architecture



a shared-disk system, and each of the systems sharing a set of disks could be a shared-memory system.

5.5.2 Key Elements Of Parallel Database Processing: Following are the key elements of parallel database processing:

1. Speed-up: Speed-up is a property in which the time taken for performing a task decreases in proportion to the increase in the number of CPUs and disks in parallel. In other words, speed-up is the property of running a given task in less time by increasing the degree of parallelism (more number of hardware). With additional hardware, speed-up holds the task constant and measures the time saved. Thus, speed-up enables users to improve the system response time for their queries, assuming the size of their databases remain roughly the same. 2. Scale-up: is the factor m that expresses how much more work can be done in the same time period by a system n times larger. With added hardware, a formula for scale-up holds the time constant, and measures the increased size of the job which can be done. 3. Synchronization: Coordination of concurrent tasks is called synchronization. Synchronization is necessary for correctness. The key to successful parallel processing is to divide up tasks so that very little synchronization is necessary. The less synchronization necessary, the better the speedup and scaleup. The amount of synchronization depends on the amount of resources and the number of users and tasks working on the resources. Little synchronization may be needed to coordinate a small number of concurrent tasks, but lots of synchronization may be necessary to coordinate many concurrent tasks 4. Locking: Locks are fundamentally a way of synchronizing tasks. Many different locking mechanisms are necessary to enable the synchronization of tasks required by parallel processing. The Integrated Distributed Lock Manager (DLM or IDLM) is the internal locking facility used with Parallel Server. It coordinates resource sharing between nodes running a parallel server. The instances of a parallel server use the Integrated Distributed Lock Manager to communicate with each other and coordinate modification of database resources. Each node operates independently of other nodes, except when contending for the same resource. 5. Messaging: Parallel processing requires fast and efficient communication between nodes: a system with high bandwidth and low latency which efficiently communicates with the IDLM. Bandwidth is the total size of messages which can be sent per second. Latency is the time (in seconds) it takes to place a message on the interconnect. Latency thus indicates the number of messages which can be put on the interconnect per second. Interconnect with high bandwidth is like a wide highway with many lanes to accommodate heavy traffic.



5.6 DISTRIBUTED DATABASE SYSTEMS:

In a distributed database system, the database is stored on several computers. The computers in a distributed system communicate with one another through various communication media, such as high-speed private networks or the Internet. They do not share main memory or disks. The computers in a distributed system may vary in size and function, ranging from workstations up to mainframe systems. The computers in a distributed system are referred to by a number of different names, such as sites or nodes, depending on the context in which they are mentioned. We mainly use the term site, to emphasize the physical distribution of these systems. The main differences between shared-nothing parallel databases and distributed databases are that distributed databases are typically geographically separated, are separately administered, and have a slower interconnection. Another major difference is that, in a distributed database system, we differentiate between local and global transactions. A local transaction is one that accesses data only from sites where the transaction was initiated. A global transaction, on the other hand, is one that either accesses data in a site different from the one at which the transaction was initiated, or accesses data in several different sites.

5.6.2 Types of Distributed Databases:

5.6.2.1 Homogeneous Distributed Databases: In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its properties are − The sites use very similar software. The sites use identical DBMS or DBMS from the same vendor. Each site is aware of all other sites and cooperates with other sites to process user requests. The database is accessed through a single interface as if it is a single database.



5.6.2.2 Heterogeneous Distributed Databases: In a heterogeneous distributed database, different sites have different operating systems, DBMS products and data models. Its properties are − Different sites use dissimilar schemas and software. The system may be composed of a variety of DBMSs like relational, network, hierarchical or object oriented. Query processing is complex due to dissimilar schemas. Transaction processing is complex due to dissimilar software. A site may not be aware of other sites and so there is limited co-operation in processing user requests.

5.6.3 Types of Distributed Database Architectures:

5.6.3.1 Client - Server Architecture for Distributed DBMS: This is a two-level architecture where the functionality is divided into servers and clients. The server functions primarily encompass data management, query processing, optimization and transaction management. Client functions include mainly user interface.

5.6.3.2 Peer-to-Peer Architecture for Distributed DBMS: Peer-to-peer architecture (P2P architecture) is a commonly used computer networking architecture in which each workstation, or node, has the same capabilities and responsibilities. It is often compared and contrasted to the classic client/server architecture, in which some computers are dedicated to serving others. In these systems, each peer acts both as a client and a server for imparting database services. The peers share their resource with other peers and co-ordinate their activities. In following figure first figure shows the general architecture of P2P and in second figure it shows the P2P architecture in DBMS.



Figure: Peer to Peer Architecture

5.6.4 Distributed Database Design: Refers to the database and its workload, how the database should be split and allocated to sites so as to optimize certain objective function. The different Distributed Database Design approaches are

5.6.4.1 Top-down approach: Top-down design process is mostly used in designing system from scratch. Figure illustrates the process of top-down design. The process starts from a requirement analysis phase including analysing of the company situation, defining problems and constraints, defining objectives, and designing scope and boundaries. The next two activities are conceptual design and view design. Focus on the data requirements, the conceptual design deals with entity relationship modelling and normalization. It creates the abstract data structure to represent the real world items. The view design defines the user interfaces. The conceptual schema is a virtual view of all databases taken together in a distributed database environment. It should cover the entity and relationship requirement for all user views. Furthermore, the conceptual model should support existing applications as well as future applications. The definition of the global conceptual schema (GCS) comes from the conceptual design. The next step is distribution design. The global conceptual schema and the access information collected from the view design activity are inputs of this step. By fragmenting and distributing entities over the system, this step designs the local conceptual schemas. Therefore, this step can be further divided into two steps: fragmentation and allocation. Distribution design also includes the selection of DBMS software in each site. The mapping of the local conceptual schemas to the physical storage devices is accomplished through the physical design activity. Throughout the design and development of the distributed database system, constant monitoring and periodic adjustment and tuning are also critical activities in order to achieve successful database implementation and suitable user interfaces. After the design phase we get Local individual schema (LIS).



5.6.4.2 Bottom-up approach: Bottom-up approach is suitable when the objective of the design is to integrate existing database systems. The bottom-up design starts from the individual local conceptual schemas and the objective of the process is integrating local schemas into the global conceptual schema. One of the most important aspects of design strategy is to determine how to integrate multiple database system together. Implementation alternatives are classified according to the autonomy, distribution, and heterogeneity of the local systems. Autonomy indicates the independency of individual DBMS. In the autonomous system, the individual DBMS are able to perform local operations independently and have no reliance on centralized service or control. The consistency of the whole system should not be affected by the behaviour of the individual DBMS. Three possible degrees of autonomy are tight integration, semiautonomous system, and total isolation. In a system which is tightly integrated, although information is stored in multiple databases, users only see a single image of the entire system. One of the DBMS controls the processing of the user request. The DBMS in semiautonomous system can operate separately and they are also willing to share their local data. In total isolated systems, individual DBMS do not know the existence of other DBMS.



5.7 INTERNET DATABASE A Internet database is a wide term for managing data online. Also called as Online databases or web database which gives you the ability to build your own databases/ data storage without you being a database guru or even a technical person. Online databases are hosted on websites, made available as software as a service products accessible via a web browser. They may be free or require payment, such as by a monthly subscription. Some have enhanced features such as collaborative editing and email notification. Internet Database Applications are usually developed using very few graphics and are built using XHTML forms and Style Sheets. Most companies are starting to migrate from the old fashioned desktop database applications to web based Internet Database Applications in XHTML format. Below are some of the benefits of Internet Database Applications:

Powerful and Scalable: Internet Database Applications are more robust, agile and able to expand and scale up more easily. Database servers that are built to serve Internet applications are designed to handle millions of concurrent connections and complex SQL queries. A good example is Facebook, which us es database servers that are able to handle millions of inquiries and complex SQL queries. Web Based: Internet Database Applications are web based applications, therefore the data can be accessed using a browser at any location. Security: Database servers have been fortified with preventive features and security protocols have been implemented to combat today's cyber security threats and vulnerabilities. Open Source: There are many powerful database servers that are open source. Many large enterprise sites are using Open Source Database Servers, such as Facebook, Yahoo, YouTube, Flickr, Wikipedia, etc. Abundant Features: There are many open source programming languages (such as PHP, Python, Ruby) and hundreds of powerful open source libraries, tools and plug-ins specifically built to interact with today's database servers.

unit-5 distributed & parallel databases · unit-5 distributed & parallel databases 5.1...

Documents