mvcc on flash memory

21
Company LOGO MVCC on Flash Memory Fan Yulei, Lab of WAMDM, School of Information, Renmin University of China, Beijing, China, 2009-06-13

Upload: selena

Post on 11-Jan-2016

40 views

Category:

Documents


3 download

DESCRIPTION

MVCC on Flash Memory. Fan Yulei, Lab of WAMDM, School of Information, Renmin University of China, Beijing, China, 2009-06-13. Outline. Motivation. MVCC. Berkeley DB. PostgreSQL. Future work. Motivation. Characteristics Not In-Place Update. HDD. Flash. Motivation. Kinds of Lock. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MVCC on Flash Memory

Company

LOGO

MVCC on Flash Memory

Fan Yulei,Lab of WAMDM,School of Information,Renmin University of China,Beijing, China,2009-06-13

Page 2: MVCC on Flash Memory

Outline

Motivation

MVCC

Berkeley DB

PostgreSQL

Future work

Page 3: MVCC on Flash Memory

Motivation

Characteristics Not In-Place Update

HDD Flash

Page 4: MVCC on Flash Memory

Motivation

Transaction

CC

•2PL•MVCC•Conflict graph•Timestamp•Index CC

Recovery

•Log

•Transaction

•Media

2PL

MVCC

1st : Lock 2nd : Release Lock

Multiple Version

Directed Acycling Graph

Timestamp Ordering

Index : B+-Tree

Log File & Data FileCheckpoint: D & S

Read Log file Undo & Redo

Read Log file Undo & RedoBackup DatabaseHot-standby : mirrored media

Kinds of Lock

Snapshot Isolation

Page 5: MVCC on Flash Memory

MVCC

Monoversion Schedule s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2

s’ = r1(x) w1(x) r2(x) r1(y) w2(y) w1(z) c1 c2

Multiversion Schedule & Monoversion Schedule Multiversion Schedule

• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y0) w1(z1) c1 c2

• h(ri(x))=wj(x) & h(wi(x))=wi(x): version function Monoversion Schedule

• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y2) w1(z1) c1 c2

• s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2

• Monoversion Schedule is a special case of Multiversion Schedule

Conflict cycle: t1,t2

Page 6: MVCC on Flash Memory

MVCC

Traditional Conflict s = w0(x) c0 w1(x) c1 r2(x) w2(y) c2

m = w0(x0) c0 w1(x1) c1 r2(x0) w2(y2) c2

View Equivalent Reads-From Relationship

• RF(m) := {(ti, x, tj) | rj(xi) OP(m) & t∈ i, tj trans(m)}∈

View Equivalent• trans(m) = trans(m’) and RF(m) = RF(m’)

Example• m = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2

• m’ = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w2(y2) w3(x3) c3 c2

Page 7: MVCC on Flash Memory

MVCC

Multiversion View Serializability Serializable but not View Equivalent

• m = w0(x0) w0(y0)c0 r1(x0) r1(y0) w1(x1) w1(y1)c1 r2(x0) r2(y1)c2

• s = w0(x) w0(y)c0 r1(x) r1(y) w1(x) w1(y)c1 r2(x) r2(y)c2

MVSR• m’ is a serialized monoversion schedule• trans(m) = trans(m’)• m and m’ are view equivalent

Example• m = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w3(x3) c3 w2(y2) c2

• m’ = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2

• s = w0(x) w0 (y) c0 r3(x) w3(x) c3 w1(x) c1 r2(x) w2(y) c2

Page 8: MVCC on Flash Memory

MVCC

Conflict Graph G(m) = (V , E) V = trans(m) ; E = {(ti, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}}

m and m’ are View Equivalent => G(m) = G(m’) Version Oder

m = w0(x0) w0 (y0) w0 (z0) c0 r1(x0) r2(x0) r2(z0) r3(z0) w1(y1) w2(x2) w3(y3) w3(z3) c1c2c3 r4(x2) r4(y3) r4(z3) c4

Version Oder = {x0«x2, y0«y1«y3, z0«z3} MVSG

MVSG = G(m) + Version Order rk(xj) and wi(xi), k≠i≠j If xi « xj then (ti, tj) ∈E; else (tk, ti) ∈E

M ∈ MVSR iff MVSG(m, «) have no cycle

T0

T2

T3

T1

T4

r2(x0)r2(y1)r2(x1)r2(y0)

Page 9: MVCC on Flash Memory

MVCC

Multiversion Conflict

ri(xj) and wk(xk) and ri(xj) < wk(xk) Multiversion Conflict Serializability

m’ is a serialized monoversion schedule trans(m) = trans(m’) Pair of operations with conflict: same ordering

Multiversion Conflict Graph

E={(ti, tk) | ri(xj) < wk(xk) } M ∈ MVCR iff MSVG(m, «) have no cycle

allMVSR

MCSR

VSRCSR

Page 10: MVCC on Flash Memory

MVCC

Limit the number of version: k=2 w0(x0) c0 r1(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(x2) c2

w0(x0) c0 r1(x0) w1(x1) c1 r2(x1) w2(x2) c2 w3(x3) c3

w0(x0) c0 r1(x0) w1(x1) c1 w3(x3) c3 r2(x3) w2(x2) c2

w0(x0) c0 r2(x0) w2(x2) c2 r1(x2) w1(x1) c1 w3(x3) c3

w0(x0) c0 r2(x0) w2(x2) c2 w3(x3) c3 r1(x3) w1(x1) c1 w0(x0) c0 w3(x3) c3 r1(x3) w1(x1) c1 r2(x1) w2(x2) c2

w0(x0) c0 w3(y3) c3 r2(x3) w2(x2) c2 r1(x2) w1(x1) c1

K-version view serializability (kVSR): Serializable View equivalent k newest/nearest version

Hierarchy Relationship

MVSRVSRVSRVSRVSR 321

x1,x2

x2,x3

x2,x3

x1,x3

x1,x3

x1,x2

x1,x2

Page 11: MVCC on Flash Memory

MVCC

MVCC Protocol

MVTO (multiversion timestamp ordering) MV2PL : 2VPL

• three kinds of kinds: rl, wl, cl

MVSGT ROMV

• Read-only transaction

Page 12: MVCC on Flash Memory

Berkeley DB

Five components Deadlock detection

• db_deadlock• DB_ENV->lock_detect, DB_ENV->set_lk_detect

Checkpoints• db_checkpoint• DB_ENV->txn_checkpoint

Database and log file archival• db_archive• DB_ENV->log_archive

Log file removal• db_archive• DB_ENV->log_archive

Recovery procedures• db_recover• DB_ENV->open

a standalone utility

one or more library interfaces

Page 13: MVCC on Flash Memory

Berkeley DB

Transaction API Transaction Subsystem and Related Methods Description

• DB_ENV->txn_checkpoint, DB_ENV->txn_recover DB_ENV->txn_stat• DB_ENV->open DB_ENV->close DB_ENV->remove

Transaction Subsystem Configuration• DB_ENV->set_timeout DB_ENV->set_tx_max DB_ENV->set_tx_timestamp

Transaction Operations• DB_ENV->txn_begin DB_TXN->abort DB_TXN-

>commit DB_TXN->discard DB_TXN->id DB_TXN->prepare DB_TXN->set_name DB_TXN->set_timeout

Page 14: MVCC on Flash Memory

Berkeley DB

2PL In Berkeley DB Locks are released

• during DB_TXN->abort or DB_TXN->commit. Guidelines:

• If possible, use nested transactions to protect the parts of your transaction most likely to deadlock

Transaction limits Transaction IDs: 31-bit unsigned integer (OX80000000) Cursors: can not span more transactions, must be opened

and closed within a single transaction Multiple Threads of Control:

Page 15: MVCC on Flash Memory

Berkeley DB

Several filesystem operations on Berkeley DB Disk seek to database file, Database file read,

Disk seek to log file, Log file write,

Disk seek to update log file metadata, Log metadata write, Flush log file information to disk,

Flush log file metadata to disk Ways to increase transactional throughput

Berkeley DB software support group commit Additional tuning parameters

• Tune the size of the database cache• Put the database and the log files on different disks• Set the filesystem configuration• Upgrade your hardware• Turn on DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC flags

– ACI, but not D

Page 16: MVCC on Flash Memory

PostgreSQL

PG: a sanpshot of data Reading never blocks writing Writing never blocks reading

Three undesirable phenomena dirty reads, non-repeatable reads, phantom read

SQL Transaction Isolation Levels

Isolation Level Dirty Read Non-Repeatable Read Phantom Read

Read uncommitted Possible Possible Possible

Read committed Not possible Possible Possible

Repeatable read Not possible Not possible Possible

Serializable Not possible Not possible Not possible

Page 17: MVCC on Flash Memory

PostgreSQL

Read Committed Isolation Level the default isolation level A SELECT query sees only data committed The SELECT does see the effects of previous updates

executed within this same transaction Two successive SELECTs can see different data

• Other transactions commit changes during executions

NOT adequate for many applications that do complex queries and updates

Serializable Isolation Level This level emulates serial transaction execution.

Page 18: MVCC on Flash Memory

PostgreSQL

Data consistency checks at the application level Readers in PostgreSQL don't lock data To ensure the current existence of a row and protect it

against concurrent updates one must use SELECT FOR UPDATE or an appropriate LOCK TABLE statement. (SELECT FOR UPDATE locks just the returned rows against concurrent updates, while LOCK TABLE protects the whole table.)

Lock and Tables Table-level Lock Row-level : when rows are being updated

Lock and Index Gist and R-tree : released after statement is done Hash Index : released after page is processed B-Tree : released immediately after each index tuple is

fetched/inserted

Page 19: MVCC on Flash Memory

ASL RSL REL SUEL SL SREL EL AEL

AccessShareLock √ √ √ √ √ √ √ ×

RowShareLock √ √ √ √ √ √ × ×

RowExclusiveLock √ √ √ √ × × × ×

ShareUpdateExclusiveLock √ √ √ × × × × ×

ShareLock √ √ × × √ × × ×

ShareRowExclusiveLock √ √ × × × × × ×

ExclusiveLock √ × × × × × × ×

AccessExclusiveLock × × × × × × × ×

SR DR IR UR AT DT CI LT

AccessShareLock √ √ √ √ √ √ √ √

RowShareLock √ √

RowExclusiveLock √ √ √ √

ShareUpdateExclusiveLock √

ShareLock √ √

ShareRowExclusiveLock √

ExclusiveLock √

AccessExclusiveLock √ √ √

Page 20: MVCC on Flash Memory

Future work

ExperimentBDB & PG CodeTransaction on Flash Memory

Concurrency Control• MVCC

Recovery• Log

Page 21: MVCC on Flash Memory

Company

LOGO