real world repairs

35
Real World Repairs @ Netflix Vinay Chella Cassandra MVP, Cloud Data Architect

Upload: vinay-kumar-chella

Post on 16-Apr-2017

98 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Real world repairs

Real World Repairs @ Netflix

Vinay Chella

Cassandra MVP, Cloud Data Architect

Page 2: Real world repairs
Page 3: Real world repairs
Page 4: Real world repairs
Page 5: Real world repairs
Page 6: Real world repairs
Page 7: Real world repairs

Who am I?

Page 8: Real world repairs

Vinay Chella

Cloud Data Architect

Cassandra MVP

Cloud Database Engineering @ Netflix

Page 9: Real world repairs

Agenda

•What is Repair and Why?

• How do we Repair @ Netflix?

• Use cases

• Pain points

• Alternatives to Repair

•When to/ not to repair

•What is missing in C*

Page 10: Real world repairs

Why is Repair Important?

• Entropy is inevitable

• Inconsistent data might impact your business

• Data Resurrection

Page 11: Real world repairs

What is Repair

Page 12: Real world repairs

Case 1 Case 2

Repair in Netflix world

• ~300 TB cluster size

• ~1 TB per node

• SSDs

• 288 Nodes - Multi Region Cluster

• Critical data, No TTLs

• ~180 TB cluster size

• ~3.6 TB per node

•HDDs

• 48 Nodes – Island cluster

•No TTLs, long lived

Page 13: Real world repairs

How do we Repair @ Netflix?

• Primary Range (-pr)

• Subrange (-st and -et)

• On every node in every dc sequentially

• Parallel (-par) & SEQUENTIAL

Page 14: Real world repairs

How do we Repair @ Netflix? Cont..

• More granular Repair-Selective tables only

• On schedule basis- Jenkins

• Repair progress tracking

Page 15: Real world repairs

Impacts of UnRepaired C*/ Use-cases

• Incorrect bookmark index

• Losing the ratings

• Recommendations impact

• Broken recently watched

Page 16: Real world repairs

Pain Points

• Over streaming

• Stream timeouts

• Stuck repairs

• Tracking/ resuming repair

• Compactions bottlenecks

• Disk I/O

• CPU usage

• Latencies/ SLA

• Wide partitions

Page 17: Real world repairs

How do we live with it?

Page 18: Real world repairs

Well, we don’t, we fix it.

Page 19: Real world repairs

Over streaming

• Subrange repair-Techniques

-Algorithm

Page 20: Real world repairs

Subrange Repair

-Get subranges https://github.com/pauloricardomg/cassandra-list-subranges▪ Thrift: describe_splits_ex

- Repair those ranges using https://github.com/BrianGallew/cassandra_range_repair

Page 21: Real world repairs

Regular Repair Subrange Repair

Subrange Repair

• Size per Node – 950 GB

- Time to repair – ~11 Hours▪ Primary Range

▪ SERIAL

• Size per Node – 550 GB

- Time to repair – ~24 Hours

- Primary Range

- SERIAL

- Wide partitions

• Size per Node – 950 GB

- Time to repair - ~2.5 Hours

- Primary Range

- Threads #: 5

- Split size 64K

• Size per Node – 550 GB

- Time to repair - ~5.5 Hours

- Primary Range

- Threads # - 10

- Split size - 32K

- Wide partitions

Page 22: Real world repairs

To overcome stream timeouts

• YAML Tunings

- streaming_socket_timeout_in_ms

- stream_throughput_outbound_megabits_

per_sec

- inter_dc_stream_throughput_outbound_

megabits_per_sec

Page 23: Real world repairs

To overcome resume issue

• Track it-Record the progress

-Record the status

• Resume-ability-Resume from failed or paused range

Page 24: Real world repairs

To fix stuck repairs

• Use Repair Tracking data

• Timeouts for long running repair- JMX notifications

• Self healable repair (Automate it !!)- Grep for repair logs

- nodetool compactionstats

- nodetool netstats

- JMX: forceTerminateAllRepairSessions()

Page 25: Real world repairs

To avoid compaction bottlenecks

• Why compactions?-Validation phase

• Tune compaction settings- compaction_throughput_mb_per_sec

Page 26: Real world repairs

Disk I/O issues

• SSD

Page 27: Real world repairs

High CPU Usage?

• Throttled streaming- nodetool setstreamthroughput

• Throttling subrange repair-Keysplit

-#threads

Page 28: Real world repairs

To minimize impacts on latencies

• Running Repairs during off-peak hours

• Switch to SERIAL repair- Instead of PARALLEL

• If Subrange-Reduce #threads

-Smaller subrange

Page 29: Real world repairs

Wide partition issues

• Reduce the partition size

Page 30: Real world repairs

Other ways to Repair

Page 31: Real world repairs

Other ways to Repair: Tickler

• Row Tickler

• All Row Tickler

Page 32: Real world repairs

When to Repair

• Multi-Region clusters

• Low consistent read and writes

• Frequent node outages

• Low read-repair chance

• Flaky networks

Page 33: Real world repairs

When not to Repair

• Short TTls

• Highly consistent read and writes

• High read repair chance settings in island clusters

Page 34: Real world repairs

What is missing in C*

• Metrics and insights

• Incremental Repair bug fixes

• CASSANDRA-10070 - Automatic repair scheduling

• Mutation-based Repairs?

Page 35: Real world repairs

Q & A