cassandra summit 2014: launching playstation 4 with apache cassandra

Post on 27-Nov-2014

546 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presenters: Alexander Filipchick and Staff Software Engineer, Staff Software Engineers at Sony Network Entertainment Since the launch of the PlayStation 4, many of the PSN features have been delivered using Cassandra. We will be talking about our experience as we launched one of the most popular gaming consoles in the world on well over 300 nodes. - Why we picked Cassandra - Exactly what PSN features for PS4 are powered by Cassandra - The infrastructure used to deploy our clusters - How we monitor system heath - How we design, test and deploy - Issues we faced and lessons learned along the way

TRANSCRIPT

Launching PS4 with Cassandra

Introduction •  Alexander Filipchik – Staff Software Engineer at SNEI

•  Dustin Pham – Staff Software Engineer at SNEI

Agenda •  Journey towards Cassandra •  Cassandra-backed PS4 Features •  Ops-y Stuff •  Lessons learned

Journey towards Cassandra

Challenges •  Small Team •  Legacy Support •  Hardware Deadline •  Scaling @ Peak Time

Why Cassandra •  Strong community •  Horizontally scalable architecture •  Good performance •  Cost effective •  New adventure J

6

PS4 Features backed by Cassandra

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  +more

Cassandra-backed PS4 features

•  What’s New •  Video Library •  My Library •  PS Now •  Notifications •  LiveArea •  Store catalog •  Pre-order •  PS Plus •  Recommendations •  Remote Download •  Share •  Authentication •  + more

Ops-y Stuff

Infrastructure •  Hosted in cloud and physical DCs •  Several hundred nodes and growing •  Cluster by feature •  Vnodes and Assigned token clusters •  Astyanax Client

Stats for PS4 cloud nodes •  Data throughput: Gigabytes / sec •  Cassandra read/writes: > 200,000 / sec •  Data size: tens of terabytes •  10M PS4 and 80M PS3 sold

24

Clusters •  Cluster per Read/Write pattern initially •  Now use cluster per feature •  Seeds referenced by DNS names •  Size Tiered compaction •  Manual compactions for some CFs

25

A typical node •  m2.4xl + i2.2xl •  2 ephemeral disks (~ 2 x 800 GB) •  Commit log on root partition •  Topology managed in the topology file

managed by chef

26

AWS •  Nodes are

interleaved between AZs – Replication factor

spreads data across AZ’s

– Minimizes downtime due to AZ outage

Availability Zone A Availability Zone C

Eph1

Disk Layout

Pre-Launch Launch Current

ü  2 Ephemerals in a RAID 0 ü  Higher throughput (io

spreads into 2 devices for reading & writing)

ü  If you lose 1 device, you loose the array !

ü  2 Ephemerals in a RAID 1 ü  Higher throughput for

reading (io spreads into 2 devices), but not for writing

ü  If you lose 1 device, the array continues up in degraded mode.

ü  ½ the available space

ü  2 individual Ephemerals ü  Higher throughput (io

spreads into 2 devices for reading & writing)

ü  You lose 1 device, Cassandra stops (configurable)

ü  No RAID overhead

Eph0

AWS m2.4xl

RAID 0

Eph1

Eph0

AWS m2.4xl

RAID 1

Eph1

Eph0

AWS m2.4xl

Cluster Resizing

Thrift Payload Size

thri%_framed_transport_size_in_mb  thri%_max_message_length_in_mb  

Bouncing Nodes phi_convict_threshold  

Inter-DC Latency

Monitor system health •  Nagios •  Kibana/Elasticsearch •  Graphite •  AWS Cloudwatch •  App level monitoring •  Opscenter

App level metrics

Lessons Learned

Fun with Astyanax Client •  Cross DC Latencies –  Several second latencies in JP and EE data

centers –  Astyanax configs to ensure local datacenters

used •  Imbalanced node traffic –  Hashing algorithm (MD5 vs Murmur3)

•  DNS Caching in the JVM –  Stale seed nodes

A tale of 2 Nodes

Cluster lessons •  A single bad node can raise app

latencies significantly •  Taking out an entire cassandra cluster is

easy (not so fun) – Compressing data before sending to

cassandra helps a lot. •  Corrupted SStable resulted in

cascading failure

•  Monitoring – Memtable flush frequency – Hinted handoffs – Garbage collection – Compactions – Histograms

•  VPNs are a dangerous bottle neck

•  Easier to rebuild a node than to fix

•  Backup data – Replication factor helps

but does not account for data corruption

•  Denormalization costs •  Disk is cheap but EC2s are

not •  TTL on almost everything •  Adjust gc_grace_period

based off TTL times •  Transactions ? Be creative •  Load test with real data

•  Replication strategy: –  Read / Write pattern –  Data is source of truth or not –  Data locality –  User Level data vs App level

data •  Cluster wide commands

should be staggered –  Global repair L

Tokens •  Vnodes vs Assigned Tokens –  Increased chattiness on gossip protocol

with vnodes – Perceived slowness on repair and cleanup

operations on vnodes enabled cluster – Astyanax client does not like vnodes…

Compactions •  Compactions are your worst enemy –  larger disk usage = high cpu & longer

compactions •  Leveled compaction vs sized compaction –  Start up time –  Cpu tradeoff –  IO tradeoff

•  Updates + Removals eat up disks

We are hiring… sonyentertainmentnetwork.com/careers

top related