cs346: advanced databases graham cormode [email protected] transaction processing
TRANSCRIPT
Outline
Chapter: “Introduction to Transaction Processing Concepts and Theory” in Elmasri and Navathe
¨ Introduce the concepts of transactions and concurrency control¨ ACID properties: atomicity, consistency, independence, durability ¨ System logging, commit points, and failure recovery¨ Schedules, serializability, and conflicts¨ Transactions in SQL
CS346 Advanced Databases2
Why?
¨ Transaction processing is an important part of databases– Jim Gray won the Turing award for his work on transactions
¨ A meeting of theory and practice– Theory explains how to produce effective sequences of transactions– Simple protocols to schedule transactions in practice
¨ An introduction to the topic of concurrency control and locking– Of importance in distributed systems and managing distributed data
CS346 Advanced Databases3
Transaction Processing
¨ Transaction Processing Systems:– Airline reservations– Banking/credit card processing– Ecommerce / online purchasing / auctions– Stock markets
¨ Common requirements– Many concurrent users making concurrent requests– High availability, fast response time– Don’t sell the same item (seat, share) to two different people!
¨ Based on the idea of atomic transactions– A transaction either succeeds or is declined
CS346 Advanced Databases4
Transaction Processing Concepts
¨ Many examples of databases so far look like single user systems¨ In practice, most databases are multiuser
– Hundreds or thousands (or more) users submitting transactions¨ Make use of wide availability of parallelism in modern systems
– Multithreaded execution per core– Multiple cores per CPU– Multiple CPUs per system (cluster)
¨ Will study management of concurrent access to shared resources– Here, data items are shared resources
CS346 Advanced Databases5
Transactions
¨ Transaction: the logical unit of database processing– Including at least one insertion, deletion or retrieval operation– May form part of a program, or specified via SQL
¨ A complex program may be broken into many basic transactions– Programmer may explicitly specify start and end of a transaction
¨ Distinguish between read-only and read-write transactions– Read-only seem easier, but still need a consistent view of data
CS346 Advanced Databases6
Databases and items
¨ Transaction processing adopts a very simple model of a database– Here: a database is a collection of named data items
¨ The granularity of the database is the size of a single data item– Can work at the level of a single database record– A higher level item: a single disk block or whole file– Lower level items: individual field (attribute) of a record– Data item may correspond to a basic concept, e.g. a seat on a flight
¨ Each data item has a unique name (identifier) used internally– E.g. the disk block address – not used by the programmer
CS346 Advanced Databases7
Basic Data Operations
¨ With this simplified database model, the basic operations are– Read(X): read the item named X into local memory– Write(X): write the item named X from local memory
¨ These cover the various substeps of data access– Map from the name (X) to the relevant disk block containing X– Moving data to/from disk via OS calls and buffers etc. – Managing cache memory to speed up operations
¨ Transactions are formed by a sequence of read/write operations
CS346 Advanced Databases8
Example Transactions
¨ All operations of a transaction must complete successfully for the transaction to be successful
¨ Read (write) set: the set of items read (written) by a transaction– Read set (a) = {X, Y}. Read set (b) = {X}– Write set (a) = {X,Y}. Write set (b) = {X}
¨ Need concurrency control and recovery– What happens if we try to run (a) and (b) at the same time?
CS346 Advanced Databases9
Need for concurrency control
¨ E.g. airline booking: don’t sell the same seat twice!– Previous transaction (a): move N reservations from X to Y– Transaction (b): reserve M sets on flight corresponding to X
¨ Bad things can happen with concurrency due to interleaving– Lost updates– Temporary updates– Incorrect aggregation– Unrepeatable reads
CS346 Advanced Databases10
Lost Updates
¨ Lost updates: when two transactions are interleaved– If T1 and T2 run as shown, the update to X from T1 is lost
¨ E.g. if X=80 to begin, N = 5 and M = 4, this order results in X = 84 rather than X=79
CS346 Advanced Databases11
Temporary Update (Dirty Read)
¨ Happens if a value is read in the middle of a transaction that fails– If a transaction fails (T1), it is rolled back to the previous state– Meanwhile, another transaction may update the intermediate value
¨ Value of X read by T2 is dirty data as it has not been committed– Hence this is sometimes called the dirty read problem
CS346 Advanced Databases12
Incorrect Summary
¨ One transaction computes an aggregate while another updates– Can include some values before update, others after update
¨ Generates a result that doesn’t correspond to before or after– In example, the correct result is same before and after T1
CS346 Advanced Databases13
Unrepeatable read
¨ Concurrency can cause problems with read-only transaction– Suppose a transaction reads item X twice at different times– The value of X is changed by another transaction in between– The first transaction gets different values for the same item!– Can arise in booking transactions: check availability, then update
CS346 Advanced Databases14
Transaction recovery
¨ Transactions should be “all or nothing”: called atomic– A transaction either complete successfully (and correctly): commit– Or has no effect on the database or other transactions: abort
¨ If a transaction fails after some operations, it must be undone– Roll-back the earlier operations– Many possible reasons for transaction failure
CS346 Advanced Databases15
Reasons for transaction failure
1. Computer failure: disk error, memory read error, crash2. Transaction/system error: divide by zero, integer overflow
– May also have out-of-bounds parameters, program bug3. Local errors or exceptions during the transaction
– E.g., can’t find the referenced item– E.g., insufficient funds for balance transfer
4. Concurrency control enforcement– System may decide to abort a transaction to ensure correctness– May need to abort to resolve “deadlock” between transactions
5. Disk failure: data on disk has got corrupted6. Physical problems: fire, theft, flood, operator error
– PEBCAK: Problem exists between chair and keyboardCS346 Advanced Databases16
Transaction states
¨ To ensure transaction atomicity, system needs to track the state– The recovery manager needs to keep track of each operation
¨ Transactions can be in one of a number of states– Active state (after starting execution, can read and write)– Partially committed state after it has finished operations
Need to reach a point where system failure would still leave the data in a consistent state
– Committed state: transaction is completed, a commit point is made– Failed state: if a check fails or transaction is aborted
May have to roll back some writes– Terminated state: the transaction leaves the system
¨ Failed or aborted transactions may be started (afresh) later
CS346 Advanced Databases17
System Log
¨ To recover from transaction failures, the system keeps a log– Track all transaction operations that affect the database
¨ The system log is a sequential, append-only file kept on disk– So more likely to survive system failure/crash
¨ Use memory to buffer the most recent updates– Write out buffers to disk when they are full– Ensure buffers are flushed to disk at a commit point– Periodically back up the log to archival storage
¨ The log consists of a sequence of log records
CS346 Advanced Databases18
System log records
¨ [start_transaction, T]: T is a unique transaction id¨ [write_item, T, X, old_value, new_value]
– Transaction T affects item X– Technically, only old_value needed for rollback
¨ [read_item, T, X]: (read entry not strictly needed for rollback)– May be included for other purposes e.g. auditing
¨ [commit, T]: T has successfully completed, and can be committed¨ [abort, T]: T has been aborted
CS346 Advanced Databases19
Failure recovery
¨ Recovering from failure means either undoing or redoing steps¨ Undo: undo each WRITE operation
– Trace backwards through the log and write the old_value¨ Redo: repeat each WRITE operation using new_value
– Needed if a failure means the writes may not have all completed– Ensures that all operations have been applied successfully
CS346 Advanced Databases20
Commit Points
¨ Commit points mark successful completion of transactions– All operations of transaction T have been executed successfully– AND the effect of all operations is recorded in the log
¨ The transaction is then committed and is permanently recorded– Write a [commit, T] record in the log
¨ If a system failure occurs: – Find all transactions T that have started but not committed– Roll back their associated operations to undo their effect– May have to redo some transactions to ensure correctness
CS346 Advanced Databases21
ACID properties of transaction processing¨ Atomicity: a transaction is an atomic unit of processing
– It is either performed completely, or not at all– Controlled by the transaction recovery subsystem of the DBMS
¨ Consistency: transactions should preserve database consistency– If a transaction is done fully, it should keep DB in a consistent state– Responsibility of the programmers, integrity constraints
¨ Isolation: effect should be independent of other transactions– It should be as if it is the only transaction executing– Enforced by the concurrency control system
¨ Durability: changes made must persist in the database– Changes made by a transaction should not be lost by any failure– Enforced by the transaction recovery subsystem
CS346 Advanced Databases22
Schedules of operations
¨ The order of execution of operations is called the schedule– Schedule S orders the operations of n transactions T1, T2, ... Tn – Operations from different transactions can be interleaved– Operations from the same transaction must be in order– S is a total order: for any two operations, one is before the other
¨ The main concern is the interleaving of read and write operations– Notation: b, r, w, e, c, a for begin, read, write, end, commit, abort– Can omit begin and end without loss of clarity– Use transaction id (number) as a subscript for each operation– E.g. S = r1(X); r2(X); w1(X); r1(Y); w2(X); w1(Y)
CS346 Advanced Databases23
Conflicts
¨ Two operations in a schedule conflict if:– They belong to different transactions– They access the same item X– At least one operation is a write_item(X)
¨ Example: S = r1(X); r2 (X); w1(X); r1(Y); w2(X); w1(Y)¨ r1(X) and w2(X) are in conflict; r2(X) and w1(X) are in conflict¨ r1(X) and r2 (X) do not conflict with each other (why?)¨ w2(X) and w1(Y) do not conflict (why?)¨ r1(X) and w1(X) do not conflict (why?)
¨ Two operations conflict if swapping them results in a different outcome– E.g. swapping r1(X) and w2(X) can change value of X read by T1
CS346 Advanced Databases24
Complete Schedule
¨ A schedule S of n transactions is a complete schedule if:– The operations in S are exactly those of T1, T2, ... Tn including a
commit or abort operation as the last in each transaction So no active transactions at the end of the schedule
– Every pair of transactions in Ti is in the same order in S as it is in Ti
– The order of every pair of conflicting operations is specified in S Don’t have to specify the order of nonconflicting operations Hence can be a partial order on the operations
¨ In live systems, schedules are rarely complete– New transactions are always starting
¨ Define the Committed projection of a schedule S, C(S)– Only operations in S belonging to committed transactions
CS346 Advanced Databases25
Recoverability of Schedules
¨ Some schedules are more easy to recover from than others– Attempt to characterize schedules that are easily recoverable
¨ Recoverable schedule: once T is committed, never have to undo T– Helps ensure the durability property– Nonrecoverable schedules should not be allowed by DBMS
¨ Formal definition for schedule S to be recoverable: – T reads transaction T’ if item X is first written by T’, later read by T– S is recoverable if no transaction T in S commits until all transactions T’
read by T have committed first– AND T’ must not have been aborted before T reads X
CS346 Advanced Databases26
Recoverable schedules
¨ There is always a way to recover a recoverable schedule– But it may still be quite complex to do so
¨ Example: S = r1(X); r2(X); w1(X); r1(Y); w2(X); c2; w1(Y); c1; – Schedule S is recoverable by the previous definition– Note: S suffers from lost updates: this does not affect recoverability
¨ The following schedule is not recoverable – why?– r1(X); w1(X); r2(X); r1(Y); w2(X); c2 ; a1
¨ Possible fixes to the schedule: – Postpone the commit c2: r1
(X); w1(X); r2(X); r1(Y); w2(X); w1(Y); c1; c2
– Abort both: r1(X); w1(X); r2(X); r1(Y); w2(X); a1 ; a2
CS346 Advanced Databases27
Cascading rollback and strict schedule¨ Cascading rollback is when an uncommitted transaction has to
be rolled back because it read from a transaction that failed– E.g. T2 in previous example r1(X); w1(X); r2(X); r1(Y); w2(X); a1 ; a2
– Try to avoid cascading rollback – can be quite time consuming¨ Can define a cascadeless schedule:
– Every transaction only reads from committed transactions– E.g. move back the r2(X): r
1(X); w1(X); r1(Y); w1(Y); c1; r2(X); w2(X); c2
¨ A strict schedule is the most restrictive type– Don’t read or write X until the last transaction to write X commits– Simple to undo writes: just restore the old value of X
If not strict, undoing write of aborted transaction is not enough
CS346 Advanced Databases28
Relation between concepts
¨ Ordering: Strict Cascadeless Recoverable– Strict: Don’t read or write X after T has written X, until T commits– Cascadeless: Transactions only read from committed transactions– Recoverable: Transactions commit only after transactions they
have read from commit
CS346 Advanced Databases29
Serializability
¨ Recoverability did not consider correctness (isolation)– Serializability is concerned with this property
¨ There are some simple approaches to serializability– Consider two transactions T1 and T2 submitted at same time – Either do T1 entirely, before T2, or vice-versa– Not great: limits throughput, can cause blocking
CS346 Advanced Databases30
Serial and non-serial schedules
¨ A schedule S is serial if for every transaction T in S, all operations in T are executed sequentially; else, it is nonserial
CS346 Advanced Databases31
Serial Schedules
¨ Only one transaction is active at an time in a serial schedule– If transactions are independent, every serial schedule is correct
¨ Serial schedules limit concurrency by prohibiting interleaving– Must wait for all I/O to finish... very slow– Serial schedules are consider unacceptable in practice
¨ Accept schedules that are equivalent to serial ones in effect– Which schedules on the previous slide are equivalent to serial?– Which suffer from the lost update problem?
¨ Serializable schedule: one that is equivalent to a serial one– Consider this to be our definition of “correctness”
¨ Need to define equivalence of schedules!– There are several alternate definitions with different properties
CS346 Advanced Databases32
Conflict serializable
¨ Result equivalent: if they produce the same final state– May happen by chance given a particular initial state
¨ Conflict equivalent is the most commonly used definition– The order of any two conflicting operations is the same in both– S is conflict serializable if it is equivalent to some serial schedule S’– That is, the nonconflicting operations can be reordered to make S’
CS346 Advanced Databases33
A & D are conflict equivalent: • r2(X) follows w1(X) in both• r1(Y); w1(Y) in D doesn’t
conflict with T2 so can be moved earlier
Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)
Testing for conflict serializability
¨ Create a serialization graph of the read and write operations– A directed graph with nodes T1, ... Tn – Directed edge (Tj Tk) if an op in Tj precedes a conflicting op in Tk
– S is serializable if and only if its serialization graph has no cycles
CS346 Advanced Databases34
Serialization Graph¨ Graphs for schedules A, B, C, D:
– Showing the name of the item causing the edges as its label
¨ Can create an equivalent serial schedule S’ from the graph of S– When there is an edge (Ti Tj), Ti must appear before Tj in S’– Else, resolve ordering arbitrarily [make total order from partial order]
CS346 Advanced Databases35
Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)
Serializability Example
CS346 Advanced Databases36
Two ops in a schedule conflict if: They are in different transactions They access the same item X At least one op is a write_item(X)
Serializability
¨ In practice, it is hard to check for serializability– Interleaving of concurrent operations controlled by the OS– DBMS can’t specify the exact order of execution of parallel tasks– Don’t want to check for serializability after the fact
¨ Instead, design schedules based on protocols– These ensure that all realized schedules are serializable
¨ Not feasible to mark start and end of schedules in live systems– Consider only the committed projection of a schedule
I.e. the operations from committed transactions¨ Most common technique is two-phase locking (2PL) [later]
– Prevent transactions that could interfere with each other– Other protocols: timestamp ordering, optimistic concurrency control
CS346 Advanced Databases37
View Equivalence
¨ View equivalence is a weaker notion than conflict equivalence– Based on the view of the data witnessed by each schedule
¨ Schedules S and S’ are said to be view equivalent if: – Both S and S’ include all operations of the same set of transactions– For any operation ri(X) in S, if the read value of X was written by
operation wj(X), the same condition must hold for S’– If wk(Y) is the last operation to write to Y in S, then wk(Y) must also
be the last to write to Y in S’¨ Read operations see the same view in both schedules
– The final write is the same in both, so the same state is reached¨ S is view serializable if it is view equivalent to a serial schedule
CS346 Advanced Databases38
View and Conflict Serializability
¨ The constrained write assumption (no blind writes): – CWA: If every wi(X) in Ti is preceded by ri(X)– Implies computation of new value of X depends on the old value– A blind write is when X is written without reading it first
¨ View and conflict serializability coincide if CWA holds¨ Unconstrained write assumption: blind writes are allowed¨ E.g. r1(X); w2(X); w1(X); w3(X); c1; c2; c3
is view serializable to r1(X); w1(X); c1; w2(X); c2; w3(X); is not conflict serializable to any serial schedule
¨ Testing for view serializability is NP-hard
CS346 Advanced Databases39
Venn diagram of schedules
http://en.wikipedia.org/wiki/Schedule_%28computer_science%29
CS346 Advanced Databases40
Other types of schedule equivalence
¨ In some situations, can relax the definition of equivalence– Such as debit-credit transactions (e.g. bank account updates)– All transactions add or subtract to the value of a data item– Can have correct schedules that are not serializable
Because addition and subtraction commute¨ Consider two transactions that want to move money
– T1: r1(X); X X – 10; w1(X); r1(Y); Y Y + 10; w1(Y);– T2: r2(Y); Y Y – 20; w2(Y); r2(X); X X + 20; w2(X);
¨ Schedule S: r1(X); w1(X); r2(Y); w2(Y); r1(Y); w1(Y); r2(X); w2(X)– S is not serializable but is correct because of transaction semantics
CS346 Advanced Databases41
Transaction Support in SQL
¨ SQL allows the definition of atomic transactions¨ No explicit begin_transaction, but must COMMIT or ROLLBACK¨ The access mode of the transaction can be specified
– READ ONLY or READ WRITE (default)¨ The diagnostic area keeps error data on n previous statements¨ The isolation level defines how strict to be with transactions
– SERIALIZABLE (default)– Lower levels: REPEATABLE READ, READ COMMITTED, READ UNCOMMITTED
May allow a transaction to read a value that has not been committed, or read a value twice in a transaction and get two values
CS346 Advanced Databases42
Sample SQL transaction
¨ So you will know what one looks like: EXEC SQL whenever sqlerror go to UNDO; EXEC SQL SET TRANSACTION READ WRITE DIAGNOSTICS SIZE 5 ISOLATION LEVEL SERIALIZABLE; EXEC SQL INSERT INTO EMPLOYEE (FNAME, LNAME, SSN, DNO, SALARY) VALUES ('Robert','Smith','991004321',2,35000); EXEC SQL UPDATE EMPLOYEE SET SALARY = SALARY * 1.1 WHERE DNO = 2; EXEC SQL COMMIT; GOTO THE_END; UNDO: EXEC SQL ROLLBACK; THE_END: ...
CS346 Advanced Databases43
Summary
CS346 Advanced Databases44
¨ Saw the concepts of transactions and concurrency control¨ ACID properties: atomicity, consistency, independence, durability ¨ System logging, commit points, and failure recovery¨ Schedules, serializability, and conflicts
¨ Multiple definitions of serializability, and checking¨ Transactions in SQL
¨ Chapter: “Introduction to Transaction Processing Concepts and Theory” in Elmasri and Navathe