the design and implementation of a transactional data - oracle

50
java.sun.com/javaone/sf | 2004 JavaOne SM Conference | Session TS-1442 1 The Design and Implementation of a Transactional Data Manager Charles Lamb Architect Sleepycat Software http://www.sleepycat.com Berkeley DB Java™ Edition

Upload: others

Post on 11-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Design and Implementation of a Transactional Data - Oracle

java.sun.com/javaone/sf

| 2004 JavaOneSM Conference | Session TS-14421

The Design and Implementationof a Transactional Data Manager

Charles LambArchitectSleepycat Softwarehttp://www.sleepycat.com

Berkeley DB Java� Edition

Page 2: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14422

Presentation Goal

Provide insight into the technical trade-offs involved in implementing a robust, high-performance, transactional database in the Java� programming language

Page 3: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14423

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 4: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14424

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 5: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14425

Motivation and Overview

� A small-footprint, high-speed, transactional database for:─ Backend storage or caching for web/app server─ Embedded transactional database─ Lightweight persistence for the Java

programming language─ Traditional transaction processing

Berkeley DB Java� Edition

Page 6: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14426

Motivation and Overview

� ACID properties� B+Tree access method� Record locking─ Read-modify-write locks─ Dirty reads

� Recovery� Large database support─ Hundreds of gigabytes of data─ Tens of millions of records

Standard DBMS features

Page 7: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14427

Motivation and Overview

� Log-based, no-overwrite storage system� Highly concurrent tree updates� Graceful interaction with JVM�

memory system

Unique features (discussed here)

Page 8: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14428

Motivation and Overview

� Schema independent� No ad-hoc queries� In-memory speeds, no IPC

from client to server� Lights-out operation� N:M transaction:thread model� Open source� No native/Java Native Interface

(JNI) code

Unique features (not discussed here)

Page 9: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-14429

Motivation and Overview

� Log-based system improves performance� Single-process, multi-threaded access� Both steady and �bursty� workloads� Read-mostly workloads─ Support high-concurrency writes too

� Typically �in-memory� applications─ But degrade gracefully if can�t fit in-memory

Design assumptions

Page 10: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144210

Motivation and Overview

� Environment (think Relational DB �Database�)─ JE Database(s) contained within environment─ May be transactional─ Sequentially numbered log files─ Several daemon threads

� Database (think Relational DB �Table�)─ Map: key/data pairs─ May be transactional

Terminology

Page 11: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144211

Motivation and Overview JE in the application JVM process

JE Library (approx. 450KB code)

Hea

p

Application code

JVM Process

File systemJE daemon threads

Page 12: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144212

Motivation and Overview

� Persistent Collections API (built on base API)─ Full implementation of Java Collections

Framework technology─No new API to learn

─ Transactions, Iterators, exact and range queries, joins (set intersection)

─ Transparent use of secondary indices, foreign keys─ Bindings allow different marshalling options

─Compact form of Java API serialization ─DataInput/DataOutput style marshalling ─Strings and primitive wrapper classes

APIs

Page 13: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144213

Motivation and Overview

� Base API─ More specific operations than Persistent Collections ─ get/put of byte[] key/data pairs─ Transactions, Cursors, exact and range queries,

joins (set intersection)─ Secondary indices, foreign keys─ Multi-threaded transactions─ Explicit configuration for performance

related parameters

APIs

Page 14: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144214

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 15: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144215

Collections Example

import com.sleepycat.collections.TransactionRunner;import com.sleepycat.collections.TransactionWorker;

public class MyWorker implements TransactionWorker {...

private Environment env = new Environment(…);...

static public void main(String argv[]) {MyWorker worker = new MyWorker();TransactionRunner runner =

new TransactionRunner(env);runner.run(worker);

}

Create a transaction

Page 16: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144216

Collections Example

import com.sleepycat.collections.StoredIterator;

public void doWork() { // in a transaction// Add an entrymap.put(“myKey”, “myData”);...// Read entries backIterator iter = map.entrySet().iterator();while (iter.hasNext()) {

Map.Entry entry = (Map.Entry) iter.next();String keyStr = entry.getKey().toString();System.out.println(keyStr + “ “ +

entry.getValue());}StoredIterator.close(iter);

}

Add an entry to a map, read entries back

Page 17: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144217

Base API Example

import com.sleepycat.je.DatabaseEntry;...

// Add an entryTransaction txn =

env.beginTransaction(null, null);DatabaseEntry key =

new DatabaseEntry(“myKey”.getBytes());DatabaseEntry data =

new DatabaseEntry(“myData”.getBytes());db.put(txn, key, data);txn.commit();

Add an entry to a database

Page 18: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144218

Base API Example

import com.sleepycat.je.Cursor;...

Transaction txn = env.beginTransaction(null, null);

Cursor cursor = db.openCursor(txn, null);DatabaseEntry key = new DatabaseEntry();DatabaseEntry data = new DatabaseEntry();while (cursor.getNext

(key, data, LockMode.DEFAULT) == OperationStatus.SUCCESS) {

System.out.println(new String(key.getData()) + “ “ +new String(data.getData()));

}cursor.close();txn.commit();

Read entries back

Page 19: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144219

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 20: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144220

Log-Based, No-OverwriteStorage System

� Based on work by─ Ousterhout, Rosenblum (1991)1

─ Seltzer, Bostic, McKusick, Staelin (1993)2

� Data is written only once (even updates)─ Disk head generally stays on same track─ Inserts and updates are fast

� Data can�t be arbitrarily clustered─ Out-of-cache reads are generally slower─ Assumption: working set fits in memory

� Log and data (the �material DB�) are the same� Backup and restore are simplified

Overview

Page 21: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144221

Log-Based, No-OverwriteStorage System

� Storage system uses NIO ByteBuffer─ Makes it easy to set up a buffer pool─ Caution: not thread safe─ Caution: watch out for Direct Memory Buffers

in 1.4.2

Java technology NIO usage

Page 22: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144222

Log-Based Storage System

� Every N MB�s, log switches to a new file� Periodic consolidation is required─ Background daemon thread─ User invoked through API call

� Cleaner scans old log files─ For each entry in the file, determine if it is in the tree

─Migrate live data to current log file─Discard (by ignoring) obsolete data

─ If all records processed successfully, delete the file

Cleaner (Persistent Garbage Collector)

Page 23: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144223

Log-Based Storage System

� Two cleaner functions─ Data migration (easy)─ Log file selection (harder)

─Maintain utilization info about �live� data per log file─Stored in a database in the environment

� No additional files or file formats� Recoverable

Cleaner log file selection

Page 24: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144224

Log-Based Storage System

� Problem: How much IO/CPU is being used?� Solution: Application invokes cleaner during

quiet times─ Future: Application can set cleaner thread priority─ Future: Allow cleaner to run in a thread pool

Cleaner challenges in Java technology-based applications

Page 25: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144225

Log-Based Storage System

� Full─ Copy all log files

� Incremental─ Copy all log files that are new or modified

since last backup

� No �holes� in the log means it is always consistent

� Hot and cold backup are the same─ No locks required during hot backup

Backup

Page 26: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144226

Log-Based Storage System

� Recovery from system failure─ Open the database, JE performs recovery

� Recovery from media failure─ Restore log files from backup─ Open the database, JE performs recovery

� Interval between checkpoints bounds recovery time

Recovery

Page 27: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144227

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 28: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144228

LNLNLN

IN

LN

IN

LN LNLNLN

IN

LNLN

Tree structure

Highly Concurrent Tree Updates

IN = Internal Node (keys only)

ININ IN IN

LN = Leaf Node (keys + data)

Page 29: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144229

Highly Concurrent Tree Updates

� Lightweight mutexes on internal data structures� Exclusive and shared/exclusive varieties� Never held across user API calls (short-lived)� Deadlock avoidance, not deadlock detection─ Multiple latch acquisitions require strict ordering

� synchronized keyword isn�t sufficient─ Fairness (first come, first served) required─ Shared/exclusive necessary

Latches

Page 30: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144230

Highly Concurrent Tree Updates

� Heavyweight mutexes for locking records─ Available with or without transactions

� Held across API calls� Read (shared), write (exclusive) varieties� Standard two-phase locking semantics─ Held until transaction end or cursor advance

� Deadlocks are detectable� Implemented using latches� First come, first served behavior

Logical locks

Page 31: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144231

Highly Concurrent Tree Updates

� Record-locking� LNs (records) are locked, not latched � INs are latched for short periods, not locked� Latch couple INs during tree descent� Never latch up the tree─ Latching up can cause latch deadlocks─ No parent pointers in the tree removes temptation

Tree latching and locking

Page 32: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144232

Highly Concurrent Tree Updates

� INs do not require rollback� INs may contain keys that are aborted

or not yet committed─ Gets/puts block on LN lock during

concurrent access

� LNs are transactional─ Aborts cause reversion to last committed LN ─ LNs are the final authority on key/data pairs

Impact on transactions and recovery

Page 33: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144233

Highly Concurrent Tree Updates

� Tree splits─ Opportunistic ─ Not undone in an abort

� Compressor─ Tree rebalancing operations─ Tree cleanup from delete operations

─Remove deleted LNs and empty subtrees─ Daemon thread or API call

Compressor

Page 34: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144234

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 35: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144235

Integration With Memory Management

� JE shares JVM process with application─ Lives within user-specified memory budget

� JE maintains it�s own memory cache─ Approximates LRU behavior─ JE cache chooses eviction target─ Can�t use weak/soft references

The JE memory cache

Page 36: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144236

Integration With Memory Management

� Determining actual object space usage─ J2SE� 1.5 platform will help to calculate better

object overhead sizes (in the future)

� Maintaining total JE cache space usage

But�

� There�s no internal fragmentation─ Everything is allocated as objects─ No fixed size buffers/pages

Challenges

Page 37: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144237

Integration With Memory Management

� Flushes memory cache� Daemon or API call� To evict an object─ Select a victim (IN or LN)─ Write to log if dirty─ Null the reference to it─ Let the GC do the rest

� One specialized concurrent data structure

Evictor

Page 38: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144238

Integration With Memory Management

� Positives:─ Objects have better granularity than pages

─Locking─ I/O─Memory management

─ Java technology is optimized for lots of small objects

� Negative:─ Variable object sizes harder to manage

Objects vs. pages

Page 39: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144239

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 40: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144240

Performance

� Commodity hardware ($1600)─ Dual CPU Pentium Xeon, 2.4GHz, hyper-threading─ 1024 MB memory─ Windows XP─ 7200 RPM IDE disk

� J2SE 1.4.2 platform� 300 byte records─ 6 bytes key + 294 bytes data

Benchmark configuration

Page 41: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144241

Performance

� 20,000 x 300 byte records� Baseline throughput (writes per second)─ Straight java.nio writes with and without fsyncs*

Compared to�

� JE insert, update, delete throughput─ Vary number of records per transaction─ With and without fsyncs* at transaction commit

* �without fsyncs� implies non-durable transactions

Update benchmark methodology

Page 42: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144242

PerformanceModifications per second

Source: Sleepycat internal performance measurement tests

1

10

100

1000

10000

100000

1 rec/txn 4 rec/txn 20 rec/txn non-durable txn

BaselineInsertUpdateDelete

CPU-boundDisk-bound

Not

e: L

ogar

ithm

ic s

cale

Page 43: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144243

Performance

� 200,000 x 300 byte records� Sequential data scans of entire database─ With and without locks/transactions─ When transactions are enabled, benchmark

performs one complete scan per transaction

Compared to�

� Random reads─ With and without locks/transactions─ When transactions are enabled, benchmark

performs one read per transaction

Read benchmark methodology

Page 44: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144244

PerformanceReads per second�Warm cache

Source: Sleepycat internal performance measurement tests

0

20,000

40,000

60,000

80,000

100,000

Scan Random

no txns, no locksno txns, lockstxns, locks

Not

e: L

inea

r sca

le

Page 45: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144245

Agenda

Motivation and OverviewCode ExamplesLog-Based, No-Overwrite Storage SystemHighly Concurrent Tree UpdatesIntegration with Memory ManagementPerformanceSummary and Conclusions

Page 46: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144246

Summary

� Use of a log-based file system can improve write performance

� Maintain synergy with the garbage collector when managing memory in Java technology-based applications

� Create special highly-concurrent structures only as needed

� Optionally allow the application to manually schedule daemon functions

Page 47: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144247

Conclusion

� High performance transaction processing is possible in the Java programming language if appropriate techniques are used

Page 48: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-144248

For More Information

� http://www.sleepycat.com to download .jar, source, Getting Started Guide, and docs

� References─ 1 �The Design and Implementation of a

Log-Structured File System�, Proceedings of the Thirteenth ACM Symposium on Operating System Principles, October 13�16, 1991.

─ 2 �An Implementation of a Log-Structured File System for UNIX®�, Proceedings 1993 Winter USENIX, San Diego.

Page 49: The Design and Implementation of a Transactional Data - Oracle

| 2004 JavaOneSM Conference | Session TS-1442 49

Q&A

Page 50: The Design and Implementation of a Transactional Data - Oracle

java.sun.com/javaone/sf

| 2004 JavaOneSM Conference | Session TS-144250

The Design and Implementationof a Transactional Data Manager

Charles LambArchitectSleepycat Softwarehttp://www.sleepycat.com

Berkeley DB Java� Edition