a technical introduction to wiredtiger
TRANSCRIPT
A Technical Introduction to WiredTiger
Presenter • Keith Bostic • Co-architect WiredTiger • Senior Staff Engineer MongoDB
Ask questions as we go, or [email protected]
3
This presentation is not…
• How to write stand-alone WiredTiger apps – contact [email protected]
• How to configure MongoDB with WiredTiger for your workload
4
WiredTiger
• Embedded database engine – general purpose toolkit – high performing: scalable throughput with low latency
• Key-value store (NoSQL) • Schema layer
– data typing, indexes • Single-node • OO APIs
– Python, C, C++, Java • Open Source
5
Deployments
• Amazon AWS • ORC/Tbricks: financial trading solution And, most important of all: • MongoDB: next-generation document store
You may have seen this:
or this…
8
MongoDB’s Storage Engine API
• Allows different storage engines to "plug-in" – different workloads have different performance characteristics – mmapV1 is not ideal for all workloads – more flexibility
• mix storage engines on same replica set/sharded cluster • Opportunity to innovate further
– HDFS, encrypted, other workloads • WiredTiger is MongoDB’s general-purpose workhorse
Topics Ø WiredTiger Architecture • In-memory performance • Record-level concurrency • Compression • Durability and the journal • Future features
10
Motivation for WiredTiger
• Traditional engines struggle with modern hardware: – lots of CPU cores – lots of RAM
• Avoid thread contention for resources – lock-free algorithms, for example, hazard pointers – concurrency control without blocking
• Hotter cache, more work per I/O – big blocks – compact file formats
– compression – big blocks
11
WiredTiger Architecture
WiredTiger Engine
Schema &Cursors
Python API C API Java API
Database Files
Transactions
Pageread/write
Logging
Column storage
Block management
Rowstorage Snapshots
Log Files
Cache
12
Column-store, LSM
• Column-store – implemented inside the B+tree – 64-bit record number keys – valued by the key’s position in the tree – variable-length or fixed-length
• LSM – forest of B+trees (row-store or column-store) – bloom filters (fixed-length column-store)
• Mix-and-match – sparse, wide table: column-store primary, LSM indexes
Topics ü WiredTiger Architecture Ø In-memory performance • Record-level concurrency • Compression • Durability and the journal • Future features
14
Trees in cache
non-residentchild
ordinary pointerroot page
internal page
internal page
root page
leaf page
leaf page leaf page leaf page
15
Hazard Pointers
non-residentchild
ordinary pointerroot page
internal page
internal page
root page
leaf page
leaf page leaf page leaf page
1 memory flush
16
Pages in cache cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
17
Skiplists
• Updates stored in skiplists – ordered linked lists with forward “skip” pointers
• William Pugh, 1989 – simpler, as fast as binary-search, less space – likely binary-search performance plus cache prefetch – more space for an existing data set
• Implementation – insert without locking – forward/backward traversal without locking, while inserting – removal requires locking
18
In-memory performance
• Cache trees/pages optimized for in-memory access • Follow pointers to traverse a tree • No locking to read or write • Keep updates separate from initial data
– updates are stored in skiplists – updates are atomic in almost all cases
• Do structural changes (eviction, splits) in background threads
Topics ü WiredTiger Architecture ü In-memory performance Ø Record-level concurrency • Compression • Durability and the journal • Future features
20
Multiversion Concurrency Control (MVCC)
• Multiple versions of records maintained in cache • Readers see most recently committed version
– read-uncommitted or snapshot isolation available – configurable per-transaction or per-handle
• Writers can create new versions concurrent with readers • Concurrent updates to a single record cause write conflicts
– one of the updates wins – other generally retries with back-off
• No locking, no lock manager
21
Pages in cache cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
skiplist
22
MVCC In Action
on-diskpageimage
index
update1(txn, value)
on-diskpageimage
index
update2(txn, value)
update1(txn, value)
update
on-diskpageimage
indexupdate
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency Ø Compression • Durability and the journal • Future features
24
Block manager
• Block allocation – fragmentation – allocation policy
• Checksums – block compression is at a higher level
• Checkpoints – involved in durability guarantees
• Opaque address cookie – stored as internal page key’s “value”
• Pluggable
25
Write path cache
data files
page images
on-disk pageimage
index
cleanpage on-disk
pageimage
indexdirtypage
updates
reconciled during write
26
In-memory Compression
• Prefix compression – index keys usually have a common prefix – rolling, per-block, requires instantiation for performance
• Huffman/static encoding – burns CPU
• Dictionary lookup – single value per page
• Run-length encoding – column-store values
27
On-disk Compression
• Compression algorithms: – snappy [default]: good compression, low overhead – LZ4: good compression, low overhead, better page layout – zlib: better compression, high overhead – pluggable
• Optional – compressing filesystem instead
28
Compression in Action
Flights database
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency ü Compression Ø Durability and the journal • Future features
30
Journal and Recovery
• Write-ahead logging (aka journal) enabled by default • Only written at transaction commit
– only write redo records • Log records are compressed • Group commit for concurrency • Automatic log archival / removal
– bounded by checkpoint frequency • On startup, find a consistent checkpoint in the metadata
– use the checkpoint to figure out how much to roll forward
31
Durability without Journaling
• MongoDB’s MMAP storage requires the journal for consistency – running with “nojournal” is unsafe
• WiredTiger is a no-overwrite data store – with “nojournal”, updates since the last checkpoint may be lost – data will still be consistent – checkpoints every N seconds by default
• Replication can guarantee durability – the network is generally faster than disk I/O
Topics ü WiredTiger Architecture ü In-memory performance ü Record-level concurrency ü Compression ü Durability and the journal Ø Future features
33
34
What’s next for WiredTiger?
• Our Big Year of Tuning – applications doing “interesting” things – stalls during checkpoints with 100GB+ caches – MongoDB capped collections
• Encryption • Advanced transactional semantics
– updates not stable until confirmed by replica majority
35
WiredTiger LSM support
• Random insert workloads • Data set much larger than cache • Query performance less important • Background maintenance overhead acceptable • Bloom filters
36
37
Benchmarks
Mark Callaghan at Facebook: http://smalldatum.blogspot.com/