mvcc on flash memory
DESCRIPTION
MVCC on Flash Memory. Fan Yulei, Lab of WAMDM, School of Information, Renmin University of China, Beijing, China, 2009-06-13. Outline. Motivation. MVCC. Berkeley DB. PostgreSQL. Future work. Motivation. Characteristics Not In-Place Update. HDD. Flash. Motivation. Kinds of Lock. - PowerPoint PPT PresentationTRANSCRIPT
Company
LOGO
MVCC on Flash Memory
Fan Yulei,Lab of WAMDM,School of Information,Renmin University of China,Beijing, China,2009-06-13
Outline
Motivation
MVCC
Berkeley DB
PostgreSQL
Future work
Motivation
Characteristics Not In-Place Update
HDD Flash
Motivation
Transaction
CC
•2PL•MVCC•Conflict graph•Timestamp•Index CC
Recovery
•Log
•Transaction
•Media
2PL
MVCC
1st : Lock 2nd : Release Lock
Multiple Version
Directed Acycling Graph
Timestamp Ordering
Index : B+-Tree
Log File & Data FileCheckpoint: D & S
Read Log file Undo & Redo
Read Log file Undo & RedoBackup DatabaseHot-standby : mirrored media
Kinds of Lock
Snapshot Isolation
MVCC
Monoversion Schedule s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
s’ = r1(x) w1(x) r2(x) r1(y) w2(y) w1(z) c1 c2
Multiversion Schedule & Monoversion Schedule Multiversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y0) w1(z1) c1 c2
• h(ri(x))=wj(x) & h(wi(x))=wi(x): version function Monoversion Schedule
• m = r1(x0) w1(x1) r2(x1) w2(y2) r1(y2) w1(z1) c1 c2
• s = r1(x) w1(x) r2(x) w2(y) r1(y) w1(z) c1 c2
• Monoversion Schedule is a special case of Multiversion Schedule
Conflict cycle: t1,t2
MVCC
Traditional Conflict s = w0(x) c0 w1(x) c1 r2(x) w2(y) c2
m = w0(x0) c0 w1(x1) c1 r2(x0) w2(y2) c2
View Equivalent Reads-From Relationship
• RF(m) := {(ti, x, tj) | rj(xi) OP(m) & t∈ i, tj trans(m)}∈
View Equivalent• trans(m) = trans(m’) and RF(m) = RF(m’)
Example• m = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w2(y2) w3(x3) c3 c2
MVCC
Multiversion View Serializability Serializable but not View Equivalent
• m = w0(x0) w0(y0)c0 r1(x0) r1(y0) w1(x1) w1(y1)c1 r2(x0) r2(y1)c2
• s = w0(x) w0(y)c0 r1(x) r1(y) w1(x) w1(y)c1 r2(x) r2(y)c2
MVSR• m’ is a serialized monoversion schedule• trans(m) = trans(m’)• m and m’ are view equivalent
Example• m = w0(x0) w0 (y0) c0 w1(x1) c1 r2(x1) r3(x0) w3(x3) c3 w2(y2) c2
• m’ = w0(x0) w0 (y0) c0 r3(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(y2) c2
• s = w0(x) w0 (y) c0 r3(x) w3(x) c3 w1(x) c1 r2(x) w2(y) c2
MVCC
Conflict Graph G(m) = (V , E) V = trans(m) ; E = {(ti, tj) | rj(xi) ∈OP(m) & ti, tj ∈trans(m)}}
m and m’ are View Equivalent => G(m) = G(m’) Version Oder
m = w0(x0) w0 (y0) w0 (z0) c0 r1(x0) r2(x0) r2(z0) r3(z0) w1(y1) w2(x2) w3(y3) w3(z3) c1c2c3 r4(x2) r4(y3) r4(z3) c4
Version Oder = {x0«x2, y0«y1«y3, z0«z3} MVSG
MVSG = G(m) + Version Order rk(xj) and wi(xi), k≠i≠j If xi « xj then (ti, tj) ∈E; else (tk, ti) ∈E
M ∈ MVSR iff MVSG(m, «) have no cycle
T0
T2
T3
T1
T4
r2(x0)r2(y1)r2(x1)r2(y0)
MVCC
Multiversion Conflict
ri(xj) and wk(xk) and ri(xj) < wk(xk) Multiversion Conflict Serializability
m’ is a serialized monoversion schedule trans(m) = trans(m’) Pair of operations with conflict: same ordering
Multiversion Conflict Graph
E={(ti, tk) | ri(xj) < wk(xk) } M ∈ MVCR iff MSVG(m, «) have no cycle
allMVSR
MCSR
VSRCSR
MVCC
Limit the number of version: k=2 w0(x0) c0 r1(x0) w3(x3) c3 w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 r1(x0) w1(x1) c1 r2(x1) w2(x2) c2 w3(x3) c3
w0(x0) c0 r1(x0) w1(x1) c1 w3(x3) c3 r2(x3) w2(x2) c2
w0(x0) c0 r2(x0) w2(x2) c2 r1(x2) w1(x1) c1 w3(x3) c3
w0(x0) c0 r2(x0) w2(x2) c2 w3(x3) c3 r1(x3) w1(x1) c1 w0(x0) c0 w3(x3) c3 r1(x3) w1(x1) c1 r2(x1) w2(x2) c2
w0(x0) c0 w3(y3) c3 r2(x3) w2(x2) c2 r1(x2) w1(x1) c1
K-version view serializability (kVSR): Serializable View equivalent k newest/nearest version
Hierarchy Relationship
MVSRVSRVSRVSRVSR 321
x1,x2
x2,x3
x2,x3
x1,x3
x1,x3
x1,x2
x1,x2
MVCC
MVCC Protocol
MVTO (multiversion timestamp ordering) MV2PL : 2VPL
• three kinds of kinds: rl, wl, cl
MVSGT ROMV
• Read-only transaction
Berkeley DB
Five components Deadlock detection
• db_deadlock• DB_ENV->lock_detect, DB_ENV->set_lk_detect
Checkpoints• db_checkpoint• DB_ENV->txn_checkpoint
Database and log file archival• db_archive• DB_ENV->log_archive
Log file removal• db_archive• DB_ENV->log_archive
Recovery procedures• db_recover• DB_ENV->open
a standalone utility
one or more library interfaces
Berkeley DB
Transaction API Transaction Subsystem and Related Methods Description
• DB_ENV->txn_checkpoint, DB_ENV->txn_recover DB_ENV->txn_stat• DB_ENV->open DB_ENV->close DB_ENV->remove
Transaction Subsystem Configuration• DB_ENV->set_timeout DB_ENV->set_tx_max DB_ENV->set_tx_timestamp
Transaction Operations• DB_ENV->txn_begin DB_TXN->abort DB_TXN-
>commit DB_TXN->discard DB_TXN->id DB_TXN->prepare DB_TXN->set_name DB_TXN->set_timeout
Berkeley DB
2PL In Berkeley DB Locks are released
• during DB_TXN->abort or DB_TXN->commit. Guidelines:
• If possible, use nested transactions to protect the parts of your transaction most likely to deadlock
Transaction limits Transaction IDs: 31-bit unsigned integer (OX80000000) Cursors: can not span more transactions, must be opened
and closed within a single transaction Multiple Threads of Control:
Berkeley DB
Several filesystem operations on Berkeley DB Disk seek to database file, Database file read,
Disk seek to log file, Log file write,
Disk seek to update log file metadata, Log metadata write, Flush log file information to disk,
Flush log file metadata to disk Ways to increase transactional throughput
Berkeley DB software support group commit Additional tuning parameters
• Tune the size of the database cache• Put the database and the log files on different disks• Set the filesystem configuration• Upgrade your hardware• Turn on DB_TXN_WRITE_NOSYNC or DB_TXN_NOSYNC flags
– ACI, but not D
PostgreSQL
PG: a sanpshot of data Reading never blocks writing Writing never blocks reading
Three undesirable phenomena dirty reads, non-repeatable reads, phantom read
SQL Transaction Isolation Levels
Isolation Level Dirty Read Non-Repeatable Read Phantom Read
Read uncommitted Possible Possible Possible
Read committed Not possible Possible Possible
Repeatable read Not possible Not possible Possible
Serializable Not possible Not possible Not possible
PostgreSQL
Read Committed Isolation Level the default isolation level A SELECT query sees only data committed The SELECT does see the effects of previous updates
executed within this same transaction Two successive SELECTs can see different data
• Other transactions commit changes during executions
NOT adequate for many applications that do complex queries and updates
Serializable Isolation Level This level emulates serial transaction execution.
PostgreSQL
Data consistency checks at the application level Readers in PostgreSQL don't lock data To ensure the current existence of a row and protect it
against concurrent updates one must use SELECT FOR UPDATE or an appropriate LOCK TABLE statement. (SELECT FOR UPDATE locks just the returned rows against concurrent updates, while LOCK TABLE protects the whole table.)
Lock and Tables Table-level Lock Row-level : when rows are being updated
Lock and Index Gist and R-tree : released after statement is done Hash Index : released after page is processed B-Tree : released immediately after each index tuple is
fetched/inserted
ASL RSL REL SUEL SL SREL EL AEL
AccessShareLock √ √ √ √ √ √ √ ×
RowShareLock √ √ √ √ √ √ × ×
RowExclusiveLock √ √ √ √ × × × ×
ShareUpdateExclusiveLock √ √ √ × × × × ×
ShareLock √ √ × × √ × × ×
ShareRowExclusiveLock √ √ × × × × × ×
ExclusiveLock √ × × × × × × ×
AccessExclusiveLock × × × × × × × ×
SR DR IR UR AT DT CI LT
AccessShareLock √ √ √ √ √ √ √ √
RowShareLock √ √
RowExclusiveLock √ √ √ √
ShareUpdateExclusiveLock √
ShareLock √ √
ShareRowExclusiveLock √
ExclusiveLock √
AccessExclusiveLock √ √ √
Future work
ExperimentBDB & PG CodeTransaction on Flash Memory
Concurrency Control• MVCC
Recovery• Log
Company
LOGO