a technical introduction to wiredtiger

A Technical Introduction to WiredTiger

[email protected]

Presenter •  Keith Bostic •  Co-architect WiredTiger •  Senior Staff Engineer MongoDB

Ask questions as we go, or [email protected]

3

This presentation is not…

•  How to write stand-alone WiredTiger apps – contact [email protected]

•  How to configure MongoDB with WiredTiger for your workload

4

WiredTiger

•  Embedded database engine – general purpose toolkit – high performing: scalable throughput with low latency

•  Key-value store (NoSQL) •  Schema layer

– data typing, indexes •  Single-node •  OO APIs

– Python, C, C++, Java •  Open Source

5

Deployments

•  Amazon AWS •  ORC/Tbricks: financial trading solution And, most important of all: •  MongoDB: next-generation document store

You may have seen this:

or this…

8

MongoDB’s Storage Engine API

•  Allows different storage engines to "plug-in" – different workloads have different performance characteristics – mmapV1 is not ideal for all workloads – more flexibility

• mix storage engines on same replica set/sharded cluster •  Opportunity to innovate further

– HDFS, encrypted, other workloads •  WiredTiger is MongoDB’s general-purpose workhorse

Topics Ø  WiredTiger Architecture •  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features

10

Motivation for WiredTiger

•  Traditional engines struggle with modern hardware: –  lots of CPU cores –  lots of RAM

•  Avoid thread contention for resources –  lock-free algorithms, for example, hazard pointers – concurrency control without blocking

•  Hotter cache, more work per I/O – big blocks – compact file formats

– compression – big blocks

11

WiredTiger Architecture

WiredTiger Engine

Schema &Cursors

Python API C API Java API

Database Files

Transactions

Pageread/write

Logging

Column storage

Block management

Rowstorage Snapshots

Log Files

Cache

12

Column-store, LSM

•  Column-store –  implemented inside the B+tree – 64-bit record number keys – valued by the key’s position in the tree – variable-length or fixed-length

•  LSM –  forest of B+trees (row-store or column-store) – bloom filters (fixed-length column-store)

•  Mix-and-match – sparse, wide table: column-store primary, LSM indexes

Topics ü  WiredTiger Architecture Ø  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features

14

Trees in cache

non-residentchild

ordinary pointerroot page

internal page

internal page

root page

leaf page

leaf page leaf page leaf page

15

Hazard Pointers

non-residentchild

ordinary pointerroot page

internal page

internal page

root page

leaf page

leaf page leaf page leaf page

1 memory flush

16

Pages in cache cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

17

Skiplists

•  Updates stored in skiplists – ordered linked lists with forward “skip” pointers

•  William Pugh, 1989 – simpler, as fast as binary-search, less space –  likely binary-search performance plus cache prefetch – more space for an existing data set

•  Implementation –  insert without locking –  forward/backward traversal without locking, while inserting –  removal requires locking

18

In-memory performance

•  Cache trees/pages optimized for in-memory access •  Follow pointers to traverse a tree •  No locking to read or write •  Keep updates separate from initial data

– updates are stored in skiplists – updates are atomic in almost all cases

•  Do structural changes (eviction, splits) in background threads

Topics ü  WiredTiger Architecture ü  In-memory performance Ø  Record-level concurrency •  Compression •  Durability and the journal •  Future features

20

Multiversion Concurrency Control (MVCC)

•  Multiple versions of records maintained in cache •  Readers see most recently committed version

–  read-uncommitted or snapshot isolation available – configurable per-transaction or per-handle

•  Writers can create new versions concurrent with readers •  Concurrent updates to a single record cause write conflicts

– one of the updates wins – other generally retries with back-off

•  No locking, no lock manager

21

Pages in cache cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

skiplist

22

MVCC In Action

on-diskpageimage

index

update1(txn, value)

on-diskpageimage

index

update2(txn, value)

update1(txn, value)

update

on-diskpageimage

indexupdate

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency Ø  Compression •  Durability and the journal •  Future features

24

Block manager

•  Block allocation –  fragmentation – allocation policy

•  Checksums – block compression is at a higher level

•  Checkpoints –  involved in durability guarantees

•  Opaque address cookie – stored as internal page key’s “value”

•  Pluggable

25

Write path cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

reconciled during write

26

In-memory Compression

•  Prefix compression –  index keys usually have a common prefix –  rolling, per-block, requires instantiation for performance

•  Huffman/static encoding – burns CPU

•  Dictionary lookup – single value per page

•  Run-length encoding – column-store values

27

On-disk Compression

•  Compression algorithms: – snappy [default]: good compression, low overhead – LZ4: good compression, low overhead, better page layout – zlib: better compression, high overhead – pluggable

•  Optional – compressing filesystem instead

28

Compression in Action

Flights database

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression Ø  Durability and the journal •  Future features

30

Journal and Recovery

•  Write-ahead logging (aka journal) enabled by default •  Only written at transaction commit

– only write redo records •  Log records are compressed •  Group commit for concurrency •  Automatic log archival / removal

– bounded by checkpoint frequency •  On startup, find a consistent checkpoint in the metadata

– use the checkpoint to figure out how much to roll forward

31

Durability without Journaling

•  MongoDB’s MMAP storage requires the journal for consistency –  running with “nojournal” is unsafe

•  WiredTiger is a no-overwrite data store – with “nojournal”, updates since the last checkpoint may be lost – data will still be consistent – checkpoints every N seconds by default

•  Replication can guarantee durability –  the network is generally faster than disk I/O

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression ü  Durability and the journal Ø  Future features

34

What’s next for WiredTiger?

•  Our Big Year of Tuning – applications doing “interesting” things – stalls during checkpoints with 100GB+ caches – MongoDB capped collections

•  Encryption •  Advanced transactional semantics

– updates not stable until confirmed by replica majority

35

WiredTiger LSM support

•  Random insert workloads •  Data set much larger than cache •  Query performance less important •  Background maintenance overhead acceptable •  Bloom filters

37

Benchmarks

Mark Callaghan at Facebook: http://smalldatum.blogspot.com/

Thanks!

Questions?

Keith Bostic [email protected]

a technical introduction to wiredtiger

Technology