monday, november 26, 12 - archive.apachecon.comarchive.apachecon.com/eu2012/presentations/08... ·...
TRANSCRIPT
Monday, November 26, 12
The CouchDB Implementation
Monday, November 26, 12
Jan [email protected]@janlCouchDB PMC ChairCommitter #2
Monday, November 26, 12
Thanks for the inviteGlad to be here
JSConf EU / BBuzz / JSFAB
Any Database
Monday, November 26, 12
- FS integration / raw storage - core data structures - core features - API
FILE SYSTEM
FILE SYSTEM ACCESS
CORE DATA STRUCTURES
CORE FEATURES
API
Monday, November 26, 12
CouchDB is no different
Monday, November 26, 12
- couch_file - couch_btree - couch_db / couch_doc / couch_mr / couch_replicator / etcpp - couch_httpd*
FILE SYSTEM
COUCH_FILE
COUCH_BTREE
COUCH_DB +
COUCH_HTTP*
COUCH_DOCCOUCH_MRCOUCH_REPL…
Monday, November 26, 12
Core Datastructures
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
Behold theb+tree
Monday, November 26, 12
Monday, November 26, 12
Monday, November 26, 12
by-id
Monday, November 26, 12
A B C D E F G H I J K L …
Monday, November 26, 12
A B C D E F
DO
C_G
H I J K L …
Monday, November 26, 12
A B C
DO
C_D
E F
DO
C_G
H I J K L …
Monday, November 26, 12
A B C
DO
C_D
E F
DO
C_G
H I J
DO
C_K
L …
Monday, November 26, 12
DO
C_D
DO
C_G
DO
C_K
Monday, November 26, 12
by-seqor
“what happened since?”
Monday, November 26, 12
1. D
OC_G
2. D
OC_D
3. D
OC_K
Monday, November 26, 12
The CouchDBFile Format
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
HEA
DER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
HEA
DER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
BY
ID
ID
X A
HEA
DER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
BY
ID
ID
X A
BY
SEQ
ID
X A
HEA
DER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
BY
ID
ID
X A
BY
SEQ
ID
X A
HEA
DER
FOO
TER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
SEQ
ID
X A
HEA
DER
FOO
TER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
HEA
DER
FOO
TER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
HEA
DER
FOO
TER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
HEA
DER
FOO
TER
FOO
TER
Monday, November 26, 12
- 2 x b+tree & data interleaved - append only, mvcc - full fsync control
- Can answer: - Data for $key - What happened $since
- Used for core data storage - As well as indexes
- Everything else is built on top
Monday, November 26, 12
Bulk add + Delete
DO
C_A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
FOO
TER
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
FOO
TER
DEL
DO
C_A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
FOO
TER
DEL
DO
C_A
BY
ID
ID
X A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
FOO
TER
DEL
DO
C_A
BY
ID
ID
X A
BY
SEQ
ID
X A
Monday, November 26, 12
Bulk add + Delete
DO
C_A
DO
C_B
BY
ID
ID
X A
BY
ID
ID
X B
BY
SEQ
ID
X A
BY
SEQ
ID
X B
FOO
TER
DEL
DO
C_A
BY
ID
ID
X A
BY
SEQ
ID
X A
FOO
TER
Monday, November 26, 12
Bulk add + Delete
Operational Consequences
Monday, November 26, 12
- efficient on spinning disk, “tape” - btree = wide, upper layers in disk cache - backup with cp $a $b
- compaction hurts
Core Features (using by-seq)
Monday, November 26, 12
- Replication - Indexing / Views / GeoCouch / Lucene / ES etc. - /_changes - Compaction
Monday, November 26, 12
Replication
DATABASE AMonday, November 26, 12
Replication
DATABASE A
1
Monday, November 26, 12
Replication
DATABASE A
2
1
Monday, November 26, 12
Replication
3
DATABASE A
2
1
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
DATABASE BMonday, November 26, 12
Replication
3
DATABASE A
2
1
4
DATABASE B
1
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
DATABASE B
2
1
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
6
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
8
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
8
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
8
6
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
8
7
6
5
Monday, November 26, 12
Replication
3
DATABASE A
2
1
4
3
DATABASE B
2
1
4
7
6
5
8
7
6
5
8
Monday, November 26, 12
Replication
Monday, November 26, 12
Indexing
DATABASE AMonday, November 26, 12
Indexing
DATABASE A
1
Monday, November 26, 12
Indexing
DATABASE A
2
1
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
INDEX AMonday, November 26, 12
Indexing
3
DATABASE A
2
1
4
INDEX A
1
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
INDEX A
2
1
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
6
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
8
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
8
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
8
6
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
8
7
6
5
Monday, November 26, 12
Indexing
3
DATABASE A
2
1
4
3
INDEX A
2
1
4
7
6
5
8
7
6
5
8
Monday, November 26, 12
Indexing
Monday, November 26, 12
/_changes
DATABASE AMonday, November 26, 12
/_changes
DATABASE A
1
Monday, November 26, 12
/_changes
DATABASE A
2
1
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
4
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
4
5
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
4
6
5
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
4
7
6
5
Monday, November 26, 12
/_changes
3
DATABASE A
2
1
4
7
6
5
8
Monday, November 26, 12
/_changes
Monday, November 26, 12
Compaction
DATABASE AMonday, November 26, 12
Compaction
DATABASE A
1. DOC_A
Monday, November 26, 12
Compaction
DATABASE A
2. DOC_B
1. DOC_A
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
6. DOC_B
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
COMPACT AMonday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT AMonday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT A
4. DOC_A
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT A
4. DOC_A
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT A
4. DOC_A
6. DOC_B
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
Monday, November 26, 12
Compaction
3. DOC_C
DATABASE A
2. DOC_B
1. DOC_A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
3. DOC_C
COMPACT A
4. DOC_A
7. DOC_F
6. DOC_B
5. DOC_D
8. DOC_G
Monday, November 26, 12
Compaction
Erlang
Monday, November 26, 12
- Small codebase - Efficient in small teams
- Supervision tree - Isolated processes - Concurrency
- Portable runtime
- Hard to recruit for - Steep ramp-on - Bit of an operational black box (nine nines story) - Bit hard as an install-dependency
COUCH_SERVER
COUCH_HTTP
COUCH_QUERY_SERVER
COUCH_LOG
COUCH_DB
COUCH_DB_UPDATER
COUCH_FILE
Monday, November 26, 12
- Small codebase - Efficient in small teams
- Supervision tree - Isolated processes - Concurrency
- Portable runtime
- Hard to recruit for - Steep ramp-on - Bit of an operational black box (nine nines story) - Bit hard as an install-dependency
Potential Improvements
Monday, November 26, 12
- Smarter compactor - Smarter file-storage - Less custom HTTP handling - More indexers
The End
Monday, November 26, 12
Thanks!
Monday, November 26, 12
Monday, November 26, 12