cassandra summit 2014: huge online genealogical database driven by cassandra

1 © 2014 by Intellectual Reserve, Inc. All rights reserved.

Huge Online Genealogical Database

Driven By Cassandra

Cassandra Summit 2014

John Sumsion

2

Outline

• Introduction to FamilySearch Family Tree

• Outline of Cassandra reimplementation

• Journal-based Consistency Model

• Experience with Cassandra

3

What is FamilySearch?

Familysearch.org website

Very large single pedigree (Family Tree)

Largest collection of free genealogical records

Largest genealogical library

Family History Department of Church of Jesus

Christ of Latter-day Saints (known as Mormons)

4

Why does FamilySearch exist?

Visit http://mormon.org/family-history/

http://mormon.org/family-history/



5

Family Tree

Records Indexing Family Tree

Memories

Community

Where it fits

6

Record Preservation

Neglect

Time

Disasters (e.g. WWII)

7

Record Preservation (continued)

• 100 million images published online / year

8

Indexing

3.5 billion indexed records – 35M / month

Turns this… …into this!

9

Memories

10

Community

11

Family Tree

Records Indexing Family Tree

Memories

Community

Where it fits

12

Family Tree Data

Family Tree:

• 900M+ person records, open-edit

• 500M+ relationships, open-edit

• 8.4B change log entries, 100M+ per quarter

• Dynamic OLTP system

• Data-dependent performance issues

13

Family Tree: Example 9 Gen Pedigree

up to 511 person slots Dynamic content!

14

Family Tree: Example Pedigree App

31+ persons per section Dynamic content!

15

Family Tree: Example Ancestor Page

10+ persons in families 100-1000+ changes Dynamic content!

16

Family Tree: Example Change History

100-1000+ changes Dynamic content!

17

Contents





18

Performance & Scale

• Slow page views • pedigree (500-3000ms for 3 generations)

• change history (2000+ms for first page of changes)

• large family view

• Query problems • relationships connect persons, range scan by person id

• every person => person traversal is 200-300M btree scan

(global index)

• change history queries travers 8+B btree scan

(global index)

19

Performance & Scale

• Query performance problems

Person Relationship

Person

Wide range scan

Pedigree

Change History Change History

Wide range scan

20

Cassandra Reimplementation

• selected Cassandra after extensive testing

• full data scale proof-of-concept & tests

• required: new data model (performance)

• required: new consistency model (critical!)

21


• event-sourced data model – journal / views

• new data model – no indexes

• new consistency model – satisfies consistency

JE #8

P1 P1 Views

A B

JE #6

P2 P2 Views

A B

22


• denormalized relationships

P1 P2

R1

R2

R3

R5

R4

23



P1 P2

R1

R2

R3

R5

R4

R2

R3

24



• exact duplication allows biderectional traversal

Person/Rels

Person/Rels

Person Relationship

Person

Wide query P1 P2

R1

R2

R3

R5

R4

R2

R3

25


• change history is a core feature

• denormalized change history

• optimizes for displaying recent changes

JE #8

P1 P1 Change History View

1000s of changes (spread over multiple Cassandra cells)

Last 100-1000 changes (local to a single Cassandra cell)

26

Contents





27

Journal-based Consistency Model

Command Journal View View

View

Rough Process Flow

captures edits safely

stores edits canonically

view-optimized summations

28


Command

• write-once with quorum

• application to journal requires 3 tables:

pending / completed / aborted

• idempotent application to journal


View

29


Command Schema

• key: command v1 uuid (as text)

• value: blob (binary json)


View

30


Journal

• write-once with quorum & C* batch

• denormalized byte-exact across

affected persons & relationships

• each entry stored in separate cell

(compaction required for fast journal reads)


View

31


Journal

• CmRDT (commutative replicated type)

• partitions converge without conflict

because of unique uuid


View

32



View

Partition Key Command UUID Content (blob)

KWZ3-P71

KWZ3-P71

eda6f967-0955…

6af8d90c-8f3a…

{ "attribution": {}, … } (binary json)

{ "attribution": {}, … } (binary json)

KCDT-J59 fd35ac61-7def… { "attribution": {}, … } (binary json)

KCDT-J59 b2db2fa5-da5f… { "attribution": {}, … } (binary json)

33


View

• multiple views for multiple uses (person, person card, change history)

• populated by applying journal entries

• incrementally updated in steady state

• not canonical data, can be recalculated


View

34



View

P1 P1 Views

A B

35



View

JE #8

P1 P1 Views

A B

JE #8 JE #8

36



View

P1 P1 Views

A B

JE #8 JE #8

A (new)

B (new)

JE #8

37



View

P1 P1 Views

A B

38


View

• views have same schema as journal

• journal entries are written to view for

incremental refresh

• core of the consistency model


View

39


View

• CvRDT (convergent replicated type)

• partitions converge with conflict; resolved

by full view refresh from canonical journal

• steady state: one view of a given type per

entity


View

40



View

P1 P1 Views

A B

JE #8 JE #8

A (new)

B (new)

JE #8

41


• Performance & Scale • lookup by partition key only, no indexes

• any cross-entity change happens in duplicate on all

• stored “current-state” views – cheapest possible read

• custom views – tunable to different use cases

• disposable views – able to tweak view over time

42


• Business Rule Enforcement • Read / Write / Read & Revert

• pre-command checks prevent invalid changes

• write with appropriate quorum ensures consistent write

• post-command checks prevent business-rules conflicts

• administrative revert marks command as “not applicable”

and thereby causes full refresh which ignores changes

• appropriate quorum: depending on the change, either

LOCAL_QUORUM or EACH_QUORUM

43


• Strong consistency • command store – atomic capture of a single user action

• command handling – idempotent writes to journal,

picked up later even if interrupted

• no global lock needed for optimistic concurrency

• Read after write • consistency ONE for normal reads

• quorum when the client knows it’s refreshing after write

44


• Journal / View Concerns • native support for change history

• no journal tombstones in steady state – write-once

• blob schema implementable on any db engine that

supports two-level keys (partition, composite)

• consistency model implementable on any db engine that

supports batches & quorum writes/reads

• view tombstones on every write, biggest concern

• leveled compaction?

• WISH: size-tiered compaction with data locality hoisting

45

Contents





46

Experience with Cassandra

• tested Community 1.2 and 2.0

• fantastic performance

• easy cloud setup

• great developer response

• easy to bulk load through CQL3

• harder to get running inside AWS VPC

47


• Bulk import experience • 8.4B change log records => 5.8B journal entries (2.5TB lzo)

• ‘hi1.4xlarge’ cluster (2x 1TB SSDs)

• import through CQL was fast enough

• 11h to import 5-node cluster (5h on 30-node cluster)

• 140k writes / sec, fed from 128 writer threads

• 20 records / unlogged batch write, 1-2k record size

• minimal post-import compaction (size-tiered)

• ended up with 3.5-4TB on C* disk after import

• OpsCenter – great visibility for tuning

• Community – harder to automate repairs, etc.

48


• Full-scale load test experience • got to 25x our peak hourly load on 25-28-node cluster

• production peak load included significant write load

• working-set size was about 2M persons in a month

• enabled row cache, ran almost entirely without disk access

• bottlenecked on interconnect socket w/ round robin client

• got 50% boost from token-aware, round robin client

• OpsCenter – great visibility for tuning

• Large SSD cluster – able to handle repair

during scale tests

49


current system

cassandra impl (1x, 10x, 20x)

50


current system

cassandra impl (1x, 10x, 20x)

LOG SCALE!

51

Current Status

• still working on implementation & rollout

• migration, reconciliation, integration…

• consistency model code separate

52

Contents





Questions?

53

Contact Info

John Sumsion

Sr. Software Engineer

[email protected]

@jdsumsion

Thanks to the team at FamilySearch! esp. Randy & James for doing the model

Thanks to the awesome presenters & organizers at #CassandraSummit!

mailto:[email protected]

cassandra summit 2014: huge online genealogical database driven by cassandra

Technology