cs4231 parallel and distributed algorithms ay 2006/2007 semester 2

32
CS4231 CS4231 Parallel and Distributed Parallel and Distributed Algorithms Algorithms AY 2006/2007 Semester 2 AY 2006/2007 Semester 2 Lecture 11 Instructor: Haifeng YU

Upload: kordell

Post on 09-Jan-2016

42 views

Category:

Documents


1 download

DESCRIPTION

CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2. Lecture 11 Instructor: Haifeng YU. Today’s Roadmap. Back to parallel systems Some simplified exploration on concurrency control in database systems Every database is a parallel system - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231CS4231Parallel and Distributed AlgorithmsParallel and Distributed Algorithms

AY 2006/2007 Semester 2AY 2006/2007 Semester 2

Lecture 11

Instructor: Haifeng YU

Page 2: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 2

Today’s RoadmapToday’s Roadmap Back to parallel systems

Some simplified exploration on concurrency control in database systems Every database is a parallel system

http://research.microsoft.com/~philbe/ccontrol/

Define “sequential consistency” in databases: Serializability

Two phase locking protocol to ensure serializability

Define “linearizability” in databases: External consistency Two phase locking ensures external consistency as well

Page 3: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 3

Database is Just an Abstract Data TypeDatabase is Just an Abstract Data Type Abstract data type: A piece of data with allowed

operations on the data Integer X, read(), write()

Stack, push(), pop()

By definition, a database is a shared abstract data type Accessed by multiple users

Processes may perform various operations (called transactions) on the database

Database consistency specifies what behavior is allowed when it is accesses by multiple processes

Page 4: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 4

TransactionsTransactions Operations are called transactions in

database context There can be infinite numbers of different

kinds of transactions (database is more flexible than for example, a stack!)

Each transaction may contain StartTransaction();

CommitTransaction();

AbortTransaction();

Read(x);

Write(y, value);

The term operation in the textbook refers to Read() and Write(). To avoid confusion, we will call them primitive operations.

StartTransaction();

seatBooked = false;read(number of available

seats on a flight);

if (number > 0) {number--;write back number;seatBooked = true;

}

CommitTransaction();

An Example Transaction

Page 5: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 5

The Scheduler and Concurrency ControlThe Scheduler and Concurrency Control

Scheduler

Database

Transaction1 Transaction2 Transaction3

Read/Write

Start/Commit/Abort/Read/Write

The job of the scheduler is concurrency control (i.e., ensuring the consistency of the database when it is accessed by multiple processes)

Page 6: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 6

The Scheduler and Concurrency ControlThe Scheduler and Concurrency Control

Scheduler

Database

Transaction1 Transaction2 Transaction3

Read/Write

Start/Commit/Abort/Read/Write

Scheduler itself is multi-threaded.

May submit reads/writes to database in parallel.

We assume that the database ensures sequential consistency for these reads/writes.

Page 7: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 7

Carry Over Definitions from Lecture 3Carry Over Definitions from Lecture 3

A history H is a sequence of invocations and responses of transactions ordered by wall clock time

Sequential history

Legal sequential history

Equivalency between two histories

Process order

A history H is sequentially consistent if it is equivalent to some legal sequential history S that preserves process order

Page 8: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 8

Serializability and Sequential ConsistencySerializability and Sequential Consistency Most databases uses serializability as the definition of consistency –

A customized version of sequential consistency specially designed for databases Same as sequential consistency except the following caveats

Caveat 1: When defining serializability, we assume that all transactions are from

different processes (no process issues two transactions)

What does it mean: Process order is empty

Why reasonable: In DB applications, this is usually the case

Why helpful: Simplifies the design of the scheduler and give it more flexibility to improve performance

Corner cases: A user issues two transactions sequentially to the database, the second transaction may not see the effects of the first.

This does not violate serializability but most implementations of the scheduler will not have such behavior

Page 9: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 9

Serializability and Sequential ConsistencySerializability and Sequential Consistency Caveat 2:

In sequential consistency, each operation is executed by a single process (each operation is sequential)

Transactions are complex enough that we should allow parallel reads/writes in a transaction (as in the book).

Each transaction is itself a parallel system!

But we will assume here that each transaction is sequential for this lecture (makes no significant difference in terms of the results)

Read the book if you are interested in extended to parallel transactions

Page 10: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 10

Serializability and Sequential ConsistencySerializability and Sequential Consistency Caveat 3: Definition of equivalency

Two histories are equivalent if they have the same set of events

Same events imply all responses are the same

For transactions, responses include all the values written into the database in the transaction and all the values output to the user

Transactions may be so complex that we cannot easily make the judgment: Consider the following transaction

UpdateX() {StartTransaction();tmp = Read(X);tmp = (4*tmp^2 + 5*tmp +1)Write(X, tmp);CommitTransaction();

}

Page 11: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 11

process 0 process 1

tmp = Read(X); (1)

tmp = 4tmp^2+5tmp+1

tmp = Read(X); (1)

tmp = 4tmp^2+5tmp+1

Write(X, tmp); (10)

Write(X, tmp); (10)

initially x = 1;

A legal sequential history will have a final x value of 451.

The history on the right is not sequentially consistent.

process 0 process 1

tmp = Read(X); (-0.5)

tmp = 4tmp^2+5tmp+1

tmp = Read(X); (-0.5)

tmp = 4tmp^2+5tmp+1

Write(X, tmp); (-0.5)

Write(X, tmp); (-0.5)

Initially x = -0.5;

A legal sequential history will have a final x value of -0.5. (-0.5 is the root of the equation tmp = 4tmp^2+5tmp+1)

The history on the right is sequentially consistent.

Whether it is sequentially consistent depends on the value of x (and insights of the code!)

Page 12: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 12

Serializability and Sequential ConsistencySerializability and Sequential Consistency Caveat 3: Definition of equivalency

Schedulers are not as smart as we are to figure that out So we are going to be more pessimistic and define conflict equivalency

Two primitive operations are conflicting if: They are both writes are they write the same data item One is read and the other is write, and they read/write the same data item

Two histories H and H’ are conflict equivalent iff They contain the same set of transactions For any two conflicting primitive operations p1 and p2, p1 is before p2 in H p1 is before p2 in H’

Conflict equivalency equivalency (assuming transactions are deterministic) (why?) The reverse is not true (by the earlier example)

Page 13: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 13

SerializabilitySerializability A history H is serializable if it is conflict equivalent to

some legal sequential history S (For comparison: A history H is sequentially consistent if it is

equivalent to some legal sequential history S that preserves process order.)

Different from linearizability: Serializability does not need to preserve operation partial order. A later transaction may not see the effects of an earlier transaction.

Possible in most commercial databases.

But the chance is small due to the actual way of implementing the scheduler. (You actually need to spend some effort to increase such chance.)

Page 14: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 14

Serialization GraphSerialization Graph A serialization graph SG(H) of a history H is a directed graph where:

Each transaction is a vertex in the graph

A directed edge from W to V exists iff W has a primitive operation p1 and V has a primitive operation p2 where p1 is before p2 in H and p1 and p2 conflict

Example history: R(x)(by T1) R(x)(by T2) W(x)(by T1) W(y) (by T2) W(y)(by T1) R(x)(by T3) W(x)(byT3)

T1 T2 T3

SG(H) may or may not be transitive

Page 15: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 15

Serializability TheoremSerializability Theorem Theorem: A history H is serializable iff SG(H) is acyclic.

If SG(H) is acyclic, then H is serializable: Without loss of generality, let T1 T2 … be the topological sorting of the vertices in SG(H). Let S

be the sequential history obtained by executing T1 T2 … sequentially. By definition, S is a legal sequential history. We need to show H is conflict equivalent to S.

Prove by contradiction. Assume H is not, then there exist W (containing primitive operation p1) and V (containing primitive operation p2) where p1 and p2 are ordered differently in H and S. Without loss of generality, suppose p1 is before p2 in H. Then there must be an edge from W to V in SG(H) and p1 will be before p2 in S as well. Contradiction.

Page 16: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 16

Serializability TheoremSerializability Theorem Theorem: A history H is serializable iff SG(H) is acyclic.

If H is serializable then SG(H) is acyclic: Prove by contradiction and assume that SG(H) has a cycle of T1 T2 …Tk T1. History H is

conflict equivalent to some sequential history S. Because T1 has an edge to T2 in SG(H), it means T1 has an operation p1 and T2 has an operation p2 where p1 and p2 conflicts and p1 is before p2 in H. Since S is conflict equivalent to H, p1 must be before p2 as well. Since S is a serial history, T1 must be before T2 in S.

By same arguments, T2 is before T3 in S, T3 is before T4 in S, …. Tk is before T1 in S. This is impossible, however, because S is a serial history.

Page 17: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 17

Serialization Graph and TheoremSerialization Graph and Theorem Serialization graph gives us a systematic way to determine

whether a history is serializable Determination can always be done in polynomial number of steps

But for sequential history: We did not have a systematic way

In some case, we have to enumerate all serial histories to compare – exponential number of steps

Why we did not discuss these before for sequential consistency?

Can you derive a similar theorem for sequential consistency? So we can always make the determination in polynomial number of

steps

Page 18: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 18

Ensuring Serializability: All About PerformanceEnsuring Serializability: All About Performance

The scheduler can protect the entire database using a single critical section Essentially produces a sequential history

Not efficient – readers (i.e. query transactions) should be able to access the database concurrently

The scheduler can protect the entire database using a Reader/Writer lock (c.f. the Reader/Writer problem in Lecture 2) Query transactions obtain reader lock

Update transactions obtain writer lock

But databases are large and each transaction only touches a small portion

Page 19: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 19

Ensuring Serializability: All About PerformanceEnsuring Serializability: All About Performance

Partition the database and use separate reader/writer locks for each partition

In the extreme, each partition is a data item

AcquireReaderLock(x);

AcquireWriterLock(y);

Read(x);

do some computation;

Write(y, value);

ReleaseReaderLock(x);

ReleaseWriterLock(y);

Locking individual data items for a transaction

Page 20: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 20

Ensuring Serializability: All About PerformanceEnsuring Serializability: All About Performance

AcquireReaderLock(x);

AcquireWriterLock(y);

Read(x);

do some computation;

Write(y, value);

ReleaseReaderLock(x);

ReleaseWriterLock(y);

Locking individual data items for a transaction

But the performance is still not very good

We may overestimate the set of data items that a transaction needs to access

We hold the locks for too long (imagine that the computation is solving some time-consuming problem

Page 21: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 21

Ensuring Serializability: All About PerformanceEnsuring Serializability: All About Performance

AcquireReaderLock(x);

Read(x);

ReleaseReaderLock(x);

do some computation;

AcquireWriterLock(y);

Write(y, value);

ReleaseWriterLock(y);

Lock the data items only when we use them This won’t work (even

intuitively)

process 0 process 1

AcquireReaderLock(x);

Read(x);

ReleaseReaderLock(x);

AcquireWriterLock(x);

Write(x);

ReleaseWriterLock(x);

do some computation;

do some computation;

AcquireWriterLock(y);

Write(y);

ReleaseWriterLock(y);

AcquireWriterLock(y);

Write(y);

ReleaseWriterLock(y);

Page 22: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 22

Ensuring Serializability: All About PerformanceEnsuring Serializability: All About Performance Prove that the history is not

serialiazable using the serialization theorem

It is impossible here to prove that it is not sequentially consistent

process 0 process 1

AcquireReaderLock(x);

Read(x);

ReleaseReaderLock(x);

AcquireWriterLock(x);

Write(x);

ReleaseWriterLock(x);

do some computation;

do some computation;

AcquireWriterLock(y);

Write(y);

ReleaseWriterLock(y);

AcquireWriterLock(y);

Write(y);

ReleaseWriterLock(y);

Page 23: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 23

Two-phase locking: A transaction must acquire lock for data item v before

reading or writing v

A transaction cannot obtain any further locks once it releases any lock

Growing phase following by shrinking phase

May result in deadlock

Side note: A transaction may “upgrade” a reader lock to a writer lock. This is considered new lock acquire as well.

In the previous example, process 0 will not release the lock on x until the end.

A Widely Used Protocol: Two-phase LockingA Widely Used Protocol: Two-phase Locking

Page 24: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 24

Correctness of Two-phase LockingCorrectness of Two-phase Locking Lemma 1: Let H be a history produced by two-phase locking.

Suppose that SG(H) contains an edge from W to V. Then there exists some data item x such that W unlocks x before V locks x in H. Proof: By definition of SG(H), if there is an edge from W to V, it

means that there exist two primitive operations p1 (in W) and p2 (in V) such that they are conflicting and p1 is before p2 in H. Let x be the data item that p1 and p2 read or write.

By two-phase locking rule, W needs to lock x before p1 occurs and V needs to lock x before p2 occurs. The only possibility is that V locks x after W unlocks x.

Page 25: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 25

Correctness of Two-phase LockingCorrectness of Two-phase Locking Lemma 2: Let H be a history produced by two-phase locking.

Suppose that SG(H) contains the path T_1T_2 … T_n. Then there exist data items x and y (x and y do not need to be distinct) such that T_1 unlocks x before T_n locks y in H. Proof: Use an induction on n. Lemma 1 proves the case for n = 2.

Assume the lemma hold for n-1 and we will prove it stills hold for n.

By the inductive assumption, we know that there exist x and z such that T_1 unlocks x before T_{n-1} locks z in H. Because there is an edge from T_{n-1} to T_n in SG(H), Lemma 1 tells us that we can find a data item y such that T_{n-1} unlocks y before T_n locks y.

By two-phase locking rule, T_{n-1} can only unlock y after it locks z. (key step!) Thus T_1 unlocks x before T_n locks y.

Page 26: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 26

Correctness of Two-phase LockingCorrectness of Two-phase Locking

Theorem: Every history H generated by two-phase locking is serializable. Prove by contradiction and assume H is not. Then SG(H)

contains a cycle T_1T_2 … T_n T_1. By Lemma 2, we can find data items x and y such that T_1 unlocks x before T_1 locks y. By this violates two-phase locking rule.

Page 27: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 27

Linearizability in DatabasesLinearizability in Databases

(From Lecture 3) A history H is linearizable if

1. It is equivalent to some legal sequential history S, and

2. The operation partial order induced by H is a subset of the operation partial order induced by S

Same as for sequential consistency, we will customize the definition for database context Caveat 1: Assume that all transactions are from different processes

Caveat 2: Transactions may be parallel (we do not consider these)

Caveat 3: Conflict equivalent instead of equivalent

Page 28: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 28

Linearizability in DatabasesLinearizability in Databases

For databases, linearizability is sometime called external consistency. A history is externally consistent if:

1. It is conflict equivalent to some legal sequential history S, and

2. The operation partial order induced by H is a subset of the operation partial order induced by S

Two-phase locking actually ensures external consistency C.f. slide 8, “most implementations of the scheduler will not

have such behavior” that violates external order

Page 29: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 29

Two-Phase Locking Preserves Two-Phase Locking Preserves External ConsistencyExternal Consistency

Theorem: Any history H generated by two-phase locking is externally consistent.

Proof: For each transaction T in H, we define its linearization point to be the time immediately after it acquires the last lock. Obviously, by two-phase locking rule, T has not released any locks at its linearization point. (This is where we leverage the two-phase locking property.) We construct a legal sequential history S to be all transactions ordered by their linearization points.

Page 30: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 30

Two-Phase Locking Preserves Two-Phase Locking Preserves External ConsistencyExternal Consistency

Claim 1: The operation (transaction) partial order induced by H is a subset of the operation (transaction) partial order induced by S Proof: Suppose WV belongs to the transaction partial order

induced by H. This means that W finishes before V starts. Obviously W finishes acquiring all locks before V finishes acquiring all locks. Thus W’s serialization point is before V’s, and W is before V in S.

Page 31: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 31

Two-Phase Locking Preserves Two-Phase Locking Preserves External ConsistencyExternal Consistency

Claim 2: H is conflict equivalent to S.

H and S contain the same set of transactions (obvious)

For any two conflicting primitive operations p1 and p2, p1 is before p2 in H p1 is before p2 in S

Proof: It is sufficient to prove that p1 is before p2 in H p1 is before p2 in S (why?) Let x be the data item accessed by both p1 and p2. Let W be the transaction

containing p1 and V be the transaction containing p2. Because p1 is before p2 in H, W must unlock x before V locks x. We have:

W will be before V in S, and thus p1 will be before p2 in SW’s serialization point W unlocks x V’s serialization pointV locks x

X X X X

Page 32: CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2

CS4231 Parallel and Distributed Algorithms AY2006/2007 Semester 2 32

SummarySummary Define “sequential consistency” in databases:

Serializability

Two phase locking protocol to ensure serializability

Define “linearizability” in databases: External consistency Two phase locking ensures external consistency as well