cassandra summit 2014: huge online genealogical database driven by cassandra

Huge Online Genealogical Database

Driven By Cassandra

Cassandra Summit 2014

John Sumsion

Outline

• Introduction to FamilySearch Family Tree

• Outline of Cassandra reimplementation

• Journal-based Consistency Model

• Experience with Cassandra

What is FamilySearch?

Familysearch.org website

Very large single pedigree (Family Tree)

Largest collection of free genealogical records

Largest genealogical library

Family History Department of Church of Jesus

Christ of Latter-day Saints (known as Mormons)

Why does FamilySearch exist?

Visit http://mormon.org/family-history/

Family Tree

Records Indexing Family Tree

Memories

Community

Where it fits

Record Preservation

Neglect

Disasters (e.g. WWII)

Record Preservation (continued)

• 100 million images published online / year

Indexing

3.5 billion indexed records – 35M / month

Turns this… …into this!

Memories

Community

Family Tree

Records Indexing Family Tree

Memories

Community

Where it fits

Family Tree Data

Family Tree:

• 900M+ person records, open-edit

• 500M+ relationships, open-edit

• 8.4B change log entries, 100M+ per quarter

• Dynamic OLTP system

• Data-dependent performance issues

Family Tree: Example 9 Gen Pedigree

up to 511 person slots Dynamic content!

Family Tree: Example Pedigree App

31+ persons per section Dynamic content!

Family Tree: Example Ancestor Page

10+ persons in families 100-1000+ changes Dynamic content!

Family Tree: Example Change History

100-1000+ changes Dynamic content!

Contents

Performance & Scale

• Slow page views • pedigree (500-3000ms for 3 generations)

• change history (2000+ms for first page of changes)

• large family view

• Query problems • relationships connect persons, range scan by person id

• every person => person traversal is 200-300M btree scan

(global index)

• change history queries travers 8+B btree scan

(global index)

Performance & Scale

• Query performance problems

Person Relationship

Person

Wide range scan

Pedigree

Change History Change History

Wide range scan

Cassandra Reimplementation

• selected Cassandra after extensive testing

• full data scale proof-of-concept & tests

• required: new data model (performance)

• required: new consistency model (critical!)

• event-sourced data model – journal / views

• new data model – no indexes

• new consistency model – satisfies consistency

P1 P1 Views

P2 P2 Views

• denormalized relationships

• exact duplication allows biderectional traversal

Person/Rels

Person Relationship

Person

Wide query P1 P2

• change history is a core feature

• denormalized change history

• optimizes for displaying recent changes

P1 P1 Change History View

1000s of changes (spread over multiple Cassandra cells)

Last 100-1000 changes (local to a single Cassandra cell)

Contents

Journal-based Consistency Model

Command Journal View View

Rough Process Flow

captures edits safely

stores edits canonically

view-optimized summations

Command

• write-once with quorum

• application to journal requires 3 tables:

pending / completed / aborted

• idempotent application to journal

Command Schema

• key: command v1 uuid (as text)

• value: blob (binary json)

Journal

• write-once with quorum & C* batch

• denormalized byte-exact across

affected persons & relationships

• each entry stored in separate cell

(compaction required for fast journal reads)

Journal

• CmRDT (commutative replicated type)

• partitions converge without conflict

because of unique uuid

Partition Key Command UUID Content (blob)

KWZ3-P71

eda6f967-0955…

6af8d90c-8f3a…

{ "attribution": {}, … } (binary json)

KCDT-J59 fd35ac61-7def… { "attribution": {}, … } (binary json)

KCDT-J59 b2db2fa5-da5f… { "attribution": {}, … } (binary json)

• multiple views for multiple uses (person, person card, change history)

• populated by applying journal entries

• incrementally updated in steady state

• not canonical data, can be recalculated

P1 P1 Views

JE #8 JE #8

P1 P1 Views

JE #8 JE #8

A (new)

B (new)

P1 P1 Views

• views have same schema as journal

• journal entries are written to view for

incremental refresh

• core of the consistency model

• CvRDT (convergent replicated type)

• partitions converge with conflict; resolved

by full view refresh from canonical journal

• steady state: one view of a given type per

entity

P1 P1 Views

JE #8 JE #8

A (new)

B (new)

• Performance & Scale • lookup by partition key only, no indexes

• any cross-entity change happens in duplicate on all

• stored “current-state” views – cheapest possible read

• custom views – tunable to different use cases

• disposable views – able to tweak view over time

• Business Rule Enforcement • Read / Write / Read & Revert

• pre-command checks prevent invalid changes

• write with appropriate quorum ensures consistent write

• post-command checks prevent business-rules conflicts

• administrative revert marks command as “not applicable”

and thereby causes full refresh which ignores changes

• appropriate quorum: depending on the change, either

LOCAL_QUORUM or EACH_QUORUM

• Strong consistency • command store – atomic capture of a single user action

• command handling – idempotent writes to journal,

picked up later even if interrupted

• no global lock needed for optimistic concurrency

• Read after write • consistency ONE for normal reads

• quorum when the client knows it’s refreshing after write

• Journal / View Concerns • native support for change history

• no journal tombstones in steady state – write-once

• blob schema implementable on any db engine that

supports two-level keys (partition, composite)

• consistency model implementable on any db engine that

supports batches & quorum writes/reads

• view tombstones on every write, biggest concern

• leveled compaction?

• WISH: size-tiered compaction with data locality hoisting

Contents

Experience with Cassandra

• tested Community 1.2 and 2.0

• fantastic performance

• easy cloud setup

• great developer response

• easy to bulk load through CQL3

• harder to get running inside AWS VPC

• Bulk import experience • 8.4B change log records => 5.8B journal entries (2.5TB lzo)

• ‘hi1.4xlarge’ cluster (2x 1TB SSDs)

• import through CQL was fast enough

• 11h to import 5-node cluster (5h on 30-node cluster)

• 140k writes / sec, fed from 128 writer threads

• 20 records / unlogged batch write, 1-2k record size

• minimal post-import compaction (size-tiered)

• ended up with 3.5-4TB on C* disk after import

• OpsCenter – great visibility for tuning

• Community – harder to automate repairs, etc.

• Full-scale load test experience • got to 25x our peak hourly load on 25-28-node cluster

• production peak load included significant write load

• working-set size was about 2M persons in a month

• enabled row cache, ran almost entirely without disk access

• bottlenecked on interconnect socket w/ round robin client

• got 50% boost from token-aware, round robin client

• OpsCenter – great visibility for tuning

• Large SSD cluster – able to handle repair

during scale tests

current system

cassandra impl (1x, 10x, 20x)

current system

cassandra impl (1x, 10x, 20x)

LOG SCALE!

Current Status

• still working on implementation & rollout

• migration, reconciliation, integration…

• consistency model code separate

Contents

Questions?

Contact Info

John Sumsion

Sr. Software Engineer

sumsionjg@familysearch.org

@jdsumsion

Thanks to the team at FamilySearch! esp. Randy & James for doing the model

Thanks to the awesome presenters & organizers at #CassandraSummit!

cassandra summit 2014: huge online genealogical database driven by cassandra

Technology

genealogical society

genealogical sources guide

approximate genealogical inference

cassandra summit 2014: cassandra compute cloud: an elastic...

genealogical dates dates, together with names, are the...

genealogical narrative

texas genealogical college genealogical...

state of cassandra, 2012 - nosql | apache cassandra ·...

apache cassandra™...

online genealogical resources

manatee genealogical society

austin genealogical society

cassandra day atlanta 2016 - monitoring cassandra

on genealogical visualization.pdf

cassandra + hadoop: analisi batch con apache cassandra

carolinas genealogical society

apache cassandra in action - o'reilly...

cassandra freeman - thoughtful...

introduction to cassandra • why spark + cassandra ... ·...

running cassandra on amazon’s ecs -...