a technical introduction to wiredtiger

38
A Technical Introduction to WiredTiger [email protected]

Upload: mongodb

Post on 06-Jan-2017

1.406 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: A Technical Introduction to WiredTiger

A Technical Introduction to WiredTiger

[email protected]

Page 2: A Technical Introduction to WiredTiger

Presenter •  Keith Bostic •  Co-architect WiredTiger •  Senior Staff Engineer MongoDB

Ask questions as we go, or [email protected]

Page 3: A Technical Introduction to WiredTiger

3

This presentation is not…

•  How to write stand-alone WiredTiger apps – contact [email protected]

•  How to configure MongoDB with WiredTiger for your workload

Page 4: A Technical Introduction to WiredTiger

4

WiredTiger

•  Embedded database engine – general purpose toolkit – high performing: scalable throughput with low latency

•  Key-value store (NoSQL) •  Schema layer

– data typing, indexes •  Single-node •  OO APIs

– Python, C, C++, Java •  Open Source

Page 5: A Technical Introduction to WiredTiger

5

Deployments

•  Amazon AWS •  ORC/Tbricks: financial trading solution And, most important of all: •  MongoDB: next-generation document store

Page 6: A Technical Introduction to WiredTiger

You may have seen this:

Page 7: A Technical Introduction to WiredTiger

or this…

Page 8: A Technical Introduction to WiredTiger

8

MongoDB’s Storage Engine API

•  Allows different storage engines to "plug-in" – different workloads have different performance characteristics – mmapV1 is not ideal for all workloads – more flexibility

• mix storage engines on same replica set/sharded cluster •  Opportunity to innovate further

– HDFS, encrypted, other workloads •  WiredTiger is MongoDB’s general-purpose workhorse

Page 9: A Technical Introduction to WiredTiger

Topics Ø  WiredTiger Architecture •  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features

Page 10: A Technical Introduction to WiredTiger

10

Motivation for WiredTiger

•  Traditional engines struggle with modern hardware: –  lots of CPU cores –  lots of RAM

•  Avoid thread contention for resources –  lock-free algorithms, for example, hazard pointers – concurrency control without blocking

•  Hotter cache, more work per I/O – big blocks – compact file formats

– compression – big blocks

Page 11: A Technical Introduction to WiredTiger

11

WiredTiger Architecture

WiredTiger Engine

Schema &Cursors

Python API C API Java API

Database Files

Transactions

Pageread/write

Logging

Column storage

Block management

Rowstorage Snapshots

Log Files

Cache

Page 12: A Technical Introduction to WiredTiger

12

Column-store, LSM

•  Column-store –  implemented inside the B+tree – 64-bit record number keys – valued by the key’s position in the tree – variable-length or fixed-length

•  LSM –  forest of B+trees (row-store or column-store) – bloom filters (fixed-length column-store)

•  Mix-and-match – sparse, wide table: column-store primary, LSM indexes

Page 13: A Technical Introduction to WiredTiger

Topics ü  WiredTiger Architecture Ø  In-memory performance •  Record-level concurrency •  Compression •  Durability and the journal •  Future features

Page 14: A Technical Introduction to WiredTiger

14

Trees in cache

non-residentchild

ordinary pointerroot page

internal page

internal page

root page

leaf page

leaf page leaf page leaf page

Page 15: A Technical Introduction to WiredTiger

15

Hazard Pointers

non-residentchild

ordinary pointerroot page

internal page

internal page

root page

leaf page

leaf page leaf page leaf page

1 memory flush

Page 16: A Technical Introduction to WiredTiger

16

Pages in cache cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

Page 17: A Technical Introduction to WiredTiger

17

Skiplists

•  Updates stored in skiplists – ordered linked lists with forward “skip” pointers

•  William Pugh, 1989 – simpler, as fast as binary-search, less space –  likely binary-search performance plus cache prefetch – more space for an existing data set

•  Implementation –  insert without locking –  forward/backward traversal without locking, while inserting –  removal requires locking

Page 18: A Technical Introduction to WiredTiger

18

In-memory performance

•  Cache trees/pages optimized for in-memory access •  Follow pointers to traverse a tree •  No locking to read or write •  Keep updates separate from initial data

– updates are stored in skiplists – updates are atomic in almost all cases

•  Do structural changes (eviction, splits) in background threads

Page 19: A Technical Introduction to WiredTiger

Topics ü  WiredTiger Architecture ü  In-memory performance Ø  Record-level concurrency •  Compression •  Durability and the journal •  Future features

Page 20: A Technical Introduction to WiredTiger

20

Multiversion Concurrency Control (MVCC)

•  Multiple versions of records maintained in cache •  Readers see most recently committed version

–  read-uncommitted or snapshot isolation available – configurable per-transaction or per-handle

•  Writers can create new versions concurrent with readers •  Concurrent updates to a single record cause write conflicts

– one of the updates wins – other generally retries with back-off

•  No locking, no lock manager

Page 21: A Technical Introduction to WiredTiger

21

Pages in cache cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

skiplist

Page 22: A Technical Introduction to WiredTiger

22

MVCC In Action

on-diskpageimage

index

update1(txn, value)

on-diskpageimage

index

update2(txn, value)

update1(txn, value)

update

on-diskpageimage

indexupdate

Page 23: A Technical Introduction to WiredTiger

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency Ø  Compression •  Durability and the journal •  Future features

Page 24: A Technical Introduction to WiredTiger

24

Block manager

•  Block allocation –  fragmentation – allocation policy

•  Checksums – block compression is at a higher level

•  Checkpoints –  involved in durability guarantees

•  Opaque address cookie – stored as internal page key’s “value”

•  Pluggable

Page 25: A Technical Introduction to WiredTiger

25

Write path cache

data files

page images

on-disk pageimage

index

cleanpage on-disk

pageimage

indexdirtypage

updates

reconciled during write

Page 26: A Technical Introduction to WiredTiger

26

In-memory Compression

•  Prefix compression –  index keys usually have a common prefix –  rolling, per-block, requires instantiation for performance

•  Huffman/static encoding – burns CPU

•  Dictionary lookup – single value per page

•  Run-length encoding – column-store values

Page 27: A Technical Introduction to WiredTiger

27

On-disk Compression

•  Compression algorithms: – snappy [default]: good compression, low overhead – LZ4: good compression, low overhead, better page layout – zlib: better compression, high overhead – pluggable

•  Optional – compressing filesystem instead

Page 28: A Technical Introduction to WiredTiger

28

Compression in Action

Flights database

Page 29: A Technical Introduction to WiredTiger

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression Ø  Durability and the journal •  Future features

Page 30: A Technical Introduction to WiredTiger

30

Journal and Recovery

•  Write-ahead logging (aka journal) enabled by default •  Only written at transaction commit

– only write redo records •  Log records are compressed •  Group commit for concurrency •  Automatic log archival / removal

– bounded by checkpoint frequency •  On startup, find a consistent checkpoint in the metadata

– use the checkpoint to figure out how much to roll forward

Page 31: A Technical Introduction to WiredTiger

31

Durability without Journaling

•  MongoDB’s MMAP storage requires the journal for consistency –  running with “nojournal” is unsafe

•  WiredTiger is a no-overwrite data store – with “nojournal”, updates since the last checkpoint may be lost – data will still be consistent – checkpoints every N seconds by default

•  Replication can guarantee durability –  the network is generally faster than disk I/O

Page 32: A Technical Introduction to WiredTiger

Topics ü  WiredTiger Architecture ü  In-memory performance ü  Record-level concurrency ü  Compression ü  Durability and the journal Ø  Future features

Page 33: A Technical Introduction to WiredTiger

33

Page 34: A Technical Introduction to WiredTiger

34

What’s next for WiredTiger?

•  Our Big Year of Tuning – applications doing “interesting” things – stalls during checkpoints with 100GB+ caches – MongoDB capped collections

•  Encryption •  Advanced transactional semantics

– updates not stable until confirmed by replica majority

Page 35: A Technical Introduction to WiredTiger

35

WiredTiger LSM support

•  Random insert workloads •  Data set much larger than cache •  Query performance less important •  Background maintenance overhead acceptable •  Bloom filters

Page 36: A Technical Introduction to WiredTiger

36

Page 37: A Technical Introduction to WiredTiger

37

Benchmarks

Mark Callaghan at Facebook: http://smalldatum.blogspot.com/

Page 38: A Technical Introduction to WiredTiger

Thanks!

Questions?

Keith Bostic [email protected]