myheritage cassandra meetup 2016

PeopleStore - blazing fast 2.6 billion profiles storageAnd other Cassandra uses cases@MyHeritage

Ran Peled, Chief ArchitectTech Talk TeachDec, 2016

MyHeritage is a leading destination for discovering, preserving and sharing family history.

Recently added: DNA for genealogy

Who are we?

Personal profiles in family trees

Personal profiles in family trees

Family trees:A complex network of people, each with personal info, life events, and connections to relatives.

Personal profiles in family trees – Sharding MySQL

Family trees:A complex network of people, each with personal info, life events, and connections to relatives. Many interconnected MySQL tables. Millions of daily updates.

Site ASite A

Site A

Event Individual

ChildInFamilyFamily

FamilyEvent

Tags

Photos

Personal profiles in family trees – Sharding MySQL

Family trees:A complex network of people, each with personal info, life events, and connections to relatives. Many interconnected MySQL tables. Millions of daily updates.

Good response time for single family site access, using MySQL Database Sharding.

Over 650 shards, on >20 physical hosts, growing

Shard 650

Partition 500Shard 1

Site ASite A

Site A

Event Individual

ChildInFamilyFamily

FamilyEvent

Tags

Photos

. . .

The issue with RDBMS sharding

Problematic when multiple shards are needed at once.For example, to display search results and profile matches coming from many family trees.

Costly to scale for more readers

Options:• Build a custom

parallel-fetch Aggregator service

• NoSQL

Cassandra to the rescue

Cassandra recap:• Key-value store• Ring-based consistent hashing

cluster• Support for clusters split

between data centers• Data redundancy and

consistency at user controlled level

• Append-only high write throughput

PeopleStore

PeopleStore: Overview

• Store 2.6 billion profiles (and growing over a million a day)• Provide very fast read access• Shadows the MySQL source of truth (at least for foreseeable future)• Data consistency is critical

• Store each person as one aggregated record in Cassandra, including ALL info for typical uses, to minimize nested/follow-up queries: get all information needed at once

• Decision point: replicate relatives, or point to their record?

PeopleStore: Architecture

Web Servers (PHP)

PeopleStore: Architecture

. . .

MySQL

Highly sharded RDBMS

Web Servers (PHP)Web Server

PoepleStore micorservicePoepleStore

micorservicePeopleStore micorservice

Source of Truth

Synchronous updates

Multi-item Fetch

Cassandra Cluster

MassLoading

Hadoop cluster

Online Flows Batch first load / reload

PeopleStore: Schema

CREATE TABLE peoplestore.people ( site_id int, tree_id int, individual_id int, adopted_child_in_family_id int, child_in_family_id int, foster_child_in_family_id int, gender text, is_alive boolean, privacy_level int, last_update int, loading_mode int, loading_time timestamp, thumbnail text, name text, events text, photos text, relatives text, PRIMARY KEY (site_id, tree_id, individual_id)) WITH … compaction = {'class': ’...LeveledCompactionStrategy'};

6 hosts, RF=3

ID

Meta data

JSON blobs

• JSON: Flexibility of structure (text, not 2.2 JSON support)

• Split fields: Flexibility to fetch fields needed

• Not using a Collection for plural fields – due to Cassandra limitation on using IN clause on table with Collection fields (non issue for us)

• Future: use User Defined Types

http://stackoverflow.com/questions/25076283/in-operator-in-cassandra-doesnt-work-for-table-having-a-column-with-type-collec

PeopleStore: Schema


6 hosts, RF=3

ID

Meta data

JSON blobs

Only minimal relatives info: ID + nameRequires another fetch for full relative data

PeopleStore: Schema


6 hosts, RF=3

ID

Meta data

JSON blobs

Started with Size Tiered Compaction. Generated thousands of SSTables and slowed query time.Moving to Leveled Compaction solved the issue.

PeopleStore: microservice

Clients• Control exposure, read / write per flow• Discover services by listing DNS SRV records • Clients do round-robin on these services

Services• A Spring Boot Java REST server• Deployed as a Docker container managed by

Mesos & Marathon• Mesos manages DNS entries• Mesos monitors services health• Metrics sent to JMX

Failure recovery needed despite redundancy• In write for consistency; in read for availability

PoepleStore micorservice (Java)


PeopleStore micorservice

Web Servers (PHP)Web Servers

(PHP)Web Server

Write failure recovery

Mes

os +

Mar

atho

n DNS

PeopleStore: Mass Loading

To boot the system, and in case of major scheme/logic changes, we had to load 2.2 billion person profiles at once.

Evaluated:• Cassandra’s sstableloader tool• hdfs2cass from SpotifyCons:• Uses SSTableSimpleWriter and Cassandra streaming• Very sensitive to C* version

Selected: Hadoop + online Cassandra updates

http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated

https://github.com/spotify/hdfs2cass

PeopleStore: Mass Loading with Hadoop

. . .

MySQL

Hadoop cluster

Extract and AggregateMySQL extractor + PIG flow Avro

LoadCrunch + Cassandra Driver

• Tested: logged/unlogged BATCH writes. Does NOT help performance.

• Had to implement write retries to reach 0 failures

• Collect stats into Hadoop counters

• Load time2.2 billion items, 6 Hadoop nodes, 6 C* nodes~30k writes per second~17 hours load + hours compaction timeImpact on read latency very reasonable

PeopleStore: Mass loading with online updates

Mass loading takes time. In the meanwhile, we have online updates.Batch load must not overwrite newer online updates

Tested: lightweight transactions: INSERT ... IF NOT EXISTS / UPDATE... IF update_time < <value>

Result: major slowdown, due to massive read-before-write

Solution: updated_people table: small table, indicating only people that changed online while batch loading is running. Read-before-write viable because table is small; >99% of queries return empty set. Insignificant slowdown.

Hadoop cluster



PeopleStore micorservice (Java)

. . .

MySQL

PeopleStore: JVM Tuning

Experienced long GC pauses in the Cassandra nodes

Upgrade from Java 1.7 to 1.8.0_65 Switch from CMS to G1 garbage collector Major improvement.This is the default in Cassandra 3.0

Tune JVM params (/etc/cassandra/conf/cassandra-env.sh)

See https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

# highlights:

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC”JVM_OPTS="$JVM_OPTS -Xms16G -Xmx16G”JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500”JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16"JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16”

https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

PeopleStore: Other issues

Experienced unexplained missing rows on read (CASSANDRA-10801)We upgraded the Cassandra nodes from 2.1.11 to 2.1.12 and the java driver from 2.1.5 to 2.1.9, which solved the issue.

Cassandra Driver: Spring @Query annotations cannot handle “IN” queries. Instead, we used CassandraTemplate to build a native query.

https://issues.apache.org/jira/browse/CASSANDRA-10801



PeopleStore: Results

Reduced latency:• Matches page: over 50% reduction of load time• Search results page: 40% reduction of load time• 90% of microservice

calls < 100ms

Reduced load on MySQL databases• From hundreds of

queries per page, to just a few

AccountStore

EVERY page on myheritage.com needs access to• Summarized user (account) information from multiple sources

For marketing tracking, affiliates programs, retargetingIncludes properties and counters, coming from various sources

• A/B test dataParticipation and variant selectionfor guests and registered members

Latency: less than 10ms slowdown for any page

Data must be fresh

Storing also for guests, lots of data

Make data available to BI systems

Aggregating the data at runtime is too slow requires maintaining live aggregated data high updates rate

AccountStore NeedsFast user properties and counters

Examplevar gtmDataLayer = [{ "site_plan":"premium-plus”, "data_subscription": "no-data-subscription", "active_paying": "not-actively-paying” "site_visits":3509, "last_mobile_sighting": "2016-02-07 11:10:25” ...}];

Use Cassandra to “store it as you read it” – updated aggregate information and counters on users and guests.

Event subscribers update the aggregate data online as it changes, in two tables: data, and counters (C* limitation)For example, num_individuals_in_trees changed online as family tree is modified, and subscription_expiration_date is changed as user becomes a paying subscriber.

A separate Cassandra table maps guests to users as they convert and register.

AccountStore: Overview

Same physical datacenter

Requirement: Allow BI systems to collect data. Do not put BI load on the production cluster.

Solution: create a fictitious data-center in the cluster

AccountStore and A/B test cluster topology

App Cassandra data center

BI Cassandra data center

BI systemclientsclients

clients

Using secondary indexes for non-typical flows

Converted guests maintain the UUID, plus mapping to/from account_id

AccountStore: Schema

CREATE TABLE accounts.account_store_data ( account_uid uuid PRIMARY KEY, creation_time timestamp, device_types set, highest_site_plan int, last_visit timestamp, . . . ) WITH ...;

CREATE TABLE accounts.account_id_guest_id ( account_id int, guest_id ascii, guest_creation_time timestamp, updated_at timestamp, uuid uuid, PRIMARY KEY ((account_id, guest_id))) WITH ...;CREATE INDEX account_id_guest_id_updated_at_idx ON accounts.account_id_guest_id (updated_at);CREATE INDEX account_id_guest_id_uuid_idx ON accounts.account_id_guest_id (uuid);

CREATE TABLE accounts.account_store_counters ( account_uid uuid PRIMARY KEY, num_individuals_in_all_trees counter, num_visits counter, . . . ) WITH ...;

Scale: Millions of active users, hundreds of active experiments: billions of rows.Latency: must not slow down the application; many pages have multiple experiments active on them.Must allow time-based collection into BI systems.

Classic implementation: Sharded MySQL. We already have a cluster sharded by Family Site ID. We do not want another MySQL cluster, sharded by User ID.

Decision: a natural addition to the AccountStore Cassandra cluster.

A/B tests

AccountStore: A/B tests schema

CREATE TABLE ab_test.member_to_experiment_ts ( uuid_bucket int, day int, hour int, experiment_id int, uuid uuid, created_at timestamp, variant_id int, PRIMARY KEY ( (uuid_bucket, day, hour), experiment_id, uuid)) WITH ...;

CREATE TABLE ab_test.member_to_experiment ( account_uid uuid, experiment_id int, created_at timestamp, created_at_ts bigint, variant_id int, PRIMARY KEY (account_uid, experiment_id)) WITH ...;CREATE INDEX member_to_experiment_experiment_id_idx ON ab_test.member_to_experiment (experiment_id);

Simple lookup of experiment variant for a user

Secondary lookup by experiment

Preventing hotspots in time-based index:Use uuid_bucketing to ensure partitioningReading requires going over all buckets.

Full dump: using sstable2json

Cassandra:• Write: 99% < 1ms• Read: 99% < 2ms

App:• 99% < 6 ms

Performance

Other Cassandra projects

• Activity feed

• Metrics using OpenTSDB

• Titan Graph Database

[email protected]

(and… we are hiring!)

Questions? Comments?

myheritage cassandra meetup 2016

Software