growing into a proactive data platform

50

Upload: liveperson

Post on 13-Apr-2017

354 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Growing into a proactive Data Platform
Page 2: Growing into a proactive Data Platform

Yaar Reuveni & Nir Hedvat

Becoming a Proactive Data Platform

Page 3: Growing into a proactive Data Platform

Yaar Reuveni

• 6 Years at Liveperson• 1 Reporting & BI• 3 Data Platform• 2 Data Platform team lead• I love to travel• And

Page 4: Growing into a proactive Data Platform

Nir Hedvat

• Software Engineer B.Sc• 3 years as a C++ Developer

at IBM Rational Rhapsody™• 1.5 years at LivePerson• Cloud and Parallel Computing

Enthusiast • Love Math and Powerlifting

Page 5: Growing into a proactive Data Platform

Agenda

• Our Scale & Operation• Evolution in becoming proactive

i. Hope & Low awarenessii. Storming & Troubleshootingiii. Fortifyingiv. Internalization & Comprehensionv. Being Proactive

• Showcases• Implementation

Page 6: Growing into a proactive Data Platform

Our Scale

• 2 M Daily chats• 100 M Daily monitored visitor sessions• 20 B Events per day• 2 TB Raw data per day• 2 PB Total in Hadoop clusters

• Hundreds producers * event types * consumers

Page 7: Growing into a proactive Data Platform

LivePerson technology stack

Page 8: Growing into a proactive Data Platform

Stage 1: Hope & Low awarenessWe built it and it’s awesome

Online producer

Offline producer

local files

DSPT Jobs

Raw Data

* DSPT - Data single point of truth

Page 9: Growing into a proactive Data Platform

Stage 1: Hope & Low awarenessWe’ve got customers

Dashboards

Data ScienceAppsReporting

Data ScienceData AccessAd-Hoc Queries

Page 10: Growing into a proactive Data Platform

Stage 2: Storming & TroubleshootingYou’ve got NOC & SCS on speed dial

Issues arise:• Data loss• Data delays• Partial data out of frame• Missing/faulty calculations for consumers• One producer does not send for over a week

Page 11: Growing into a proactive Data Platform

Stage 2: Storming & TroubleshootingYou’ve got NOC & SCS on speed dial

Common issues types and generators:• Hadoop ops• Production ops• Events schema• New data producers• High new features rate (LE2.0)• Data stuck in pipeline• Bugs

Page 12: Growing into a proactive Data Platform

Stage 3: FortifyingEvery interruption derives a new protection

Page 13: Growing into a proactive Data Platform

Stage 3: FortifyingEvery interruption derives a new protection

Page 14: Growing into a proactive Data Platform

Stage 3: FortifyingEvery interruption derives a new protection

• Monitors on jobs, failures, success rate• Monitors on service status• Simple data freshness checks e.g. measure the

newest event• Measure latency of specific parts of the pipeline

Page 15: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionAuditing requirements

• Measure principles:– Loss

• How much?• Which customer?• What Type?• Where in the pipeline?

– Freshness• Percentiles• Trends

– Statistics• Event type count• Event per LP customer• Trends

Page 16: Growing into a proactive Data Platform

Producer

Audit DB

Audit Aggregator

Audit Loader

Stage 4: Internalization & ComprehensionAuditing architecture

ProducerProducer

Events Audit Events

Control

Freshness

Page 17: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionMechanism

Data

Common Header

Audit Header1. Enrich events with audit metadata

Control Event - Audit aggregation

Common Header

Audit Header2. Send control events per x minutes

Page 18: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionMechanism

Data

Common Header

Data

Common Header

Data

Common Header

Data

Common Header

Data

Common Header

Data

Common Header

Audit Header

Control Event - Audit aggregation

Common HeaderAudit Header

Control Event - Audit aggregation

Common HeaderAudit Header

Data

Common Header

Audit Header

Data

Common Header

Audit Header

Data

Common Header

Audit Header

Data

Common Header

Audit Header

Data

Common Header

Old Data Flow

Audited Data Flow

Page 19: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionHow to measure loss?

• Tag all events going through our API with an auditing header:

<host_name>:<bulk_id>:<sequence_id>When:

• host_name - the logical identification of the producer server• bulk_id - an arbitrary unique number that should identify a bulk (changes every X

minutes)• sequence_id - auto incremented persistent number used to identify missing bulks

• Every X minutes send an audit control event:{eventType: AuditControlEvent, Bulks: [{bulk_id:“srv-xyz:111:97”, data_tier:”shark producer”, total_count:785}, {bulk_id:“srv-xyz:112:98”, data_tier:”shark producer”, total_count:1715}]}

Page 20: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionWhat’s next?

• Immediate gain: enables research loss straight on the raw data

Next:• Count events per auditing bulk• Load into some DB for dashboarding:

In this example, assuming you look at the table after 11:34, and we refer to more than 3 hours as loss, we can see that from server srv-xyz at bulk_id 1a2b3c we can see 750 events were created and only 405+250 = 655 events arrived within 3 hours this means we can detect a loss of 95 events from this server.

Audit metadata Data Tier Insertion time Events count

srv-xyz:1a2b3c:25 Producer 08:34 750

srv-xyz:1a2b3c:25 HDFS 09:05 405

srv-xyz:1a2b3c:25 HDFS 10:13 250

Page 21: Growing into a proactive Data Platform

Stage 4: Internalization & ComprehensionHow to measure freshness?

• Run incremental on the raw data• Group events by

– Total– Event type– LP customer

• Per event calculate Insertion time - creation time• Per group:

– Total count– Min, max & average – Count into time buckets (0-30; 30-60; 60-120; 120-∞)

Page 22: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - loss dashboard

Page 23: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - loss detailed dashboard

Page 24: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - loss trends

Page 25: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - freshness

Page 26: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - freshness

Page 27: Growing into a proactive Data Platform

Stage 5: Being ProactiveTools - data statistics

Page 28: Growing into a proactive Data Platform

Showcase IBug in a new producer

Page 29: Growing into a proactive Data Platform

Showcase IIDeployment issue

• Constant loss• Only in one farm• Depends on traffic• Only a specific producer type • From all of its nodes

Page 30: Growing into a proactive Data Platform

Showcase IIIConsumer jobs issues

• Our auditing detected a loss in Alpha• Data stuck in a job failure dir• Functional monitoring missed it• We streamed the stuck data

Page 31: Growing into a proactive Data Platform

Showcase IVProducer issues

• Offline producer gets stuck• Functional monitoring misses

Page 32: Growing into a proactive Data Platform

ImplementationAuditing architecture

Producer

Audit DB

Audit Aggregator

Audit Loader

ProducerProducer

Events Audit Events

Control

Freshness

Page 33: Growing into a proactive Data Platform

ImplementationAuditing architecture

Producer

Audit DB

Audit Aggregator

Audit Loader

ProducerProducer

Events Audit Events

Control

Freshness

Page 34: Growing into a proactive Data Platform

• Storm topology• Load audit events from Kafka to MySql

Bulk Tier TS Count

xyz:123 WRPA 08:34 750

xyz:123 DSPT 09:05 405

xyz:123 DSPT 10:13 250

ImplementationAudit Loader

Audit DB

Audit Loader

Audit Events

Page 35: Growing into a proactive Data Platform

ImplementationAuditing architecture

Producer

Audit DB

Audit Aggregator

Audit Loader

ProducerProducer

Events Audit Events

Control

Freshness

Page 36: Growing into a proactive Data Platform

• Load data from HDFS

• Aggregate events according to audit metadata

• Save aggregated audit data to MySql

• Spark implementation

ImplementationAudit Aggregator

Page 37: Growing into a proactive Data Platform

HDFS

DBData

Aggregate

#1 #2 #3

∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3

Collect & Save ZooKeeperOffset

Audit Aggregator jobFirst Generation

Page 38: Growing into a proactive Data Platform

• Our jobs work incrementally or manually • Offset management by ZooKeeper

• Failing during saving stage leads to lost offset

• Saving data and offset on same stream

Audit Aggregator jobOvercoming Pitfalls

Page 39: Growing into a proactive Data Platform

Audit Aggregator jobRevised Design

HDFS

DBAggregate

#1 #2 #3

∑ #1 = N1 ∑ #2 = N2 ∑ #3 = N3

Collect & Save

Data

Offset

Bulk Tier TS Count

xyz:123 WRPA 08:34 750

xyz:123 DSPT 09:05 405

xyz:123 DSPT 10:13 250

Page 40: Growing into a proactive Data Platform

• Precedent - Spark Streaming for online auditing• We see our future with Spark• Cluster utilization• Performance

– In-memory computation– Supports multiple shuffles– Unified data processing: batch/streaming

Audit Aggregator jobWhy Spark

Page 41: Growing into a proactive Data Platform

ImplementationAuditing architecture

Producer

Audit DB

Audit Aggregator

Audit Loader

ProducerProducer

Events Audit Events

Control

Freshness

Page 42: Growing into a proactive Data Platform

• End-to-end latency assessment

• Freshness per criteria

• Output - various stats

ImplementationData Freshness

Page 43: Growing into a proactive Data Platform

Freshness jobDesign

Map

Reduce

HDFS

Total LP Customer Event Type

Min Max Avg BucketsCount

EventEventEventEvent

Page 44: Growing into a proactive Data Platform

Freshness jobMechanism

• Driver– Collects LP events from HDFS

• Map– Compute freshness latencies– Segmentize events per criteria by generating

a composite kay

• Reduce– Compute count, min, max, avg and buckets– Write stats to HDFS

Page 45: Growing into a proactive Data Platform

Freshness jobOutput usage

Page 46: Growing into a proactive Data Platform

Hadoop PlatformOvercoming Pitfalls

• Our data model is built over Avro• Avro comes with schema evolution• Avro data is stored along with its schema• High model-modification rate• LOBs schema changes are synchronized

Producer → Consumer

Page 47: Growing into a proactive Data Platform

Hadoop PlatformOvercoming Pitfalls

• MR/Spark job is revision-compiled when using SpecificRecord

• Using GenericRecord removes the burden of recompiling each time schema changes

Page 48: Growing into a proactive Data Platform

ImplementationAuditing architecture

Producer

Audit DB

Audit Aggregator

Audit Loader

ProducerProducer

Events Audit Events

Control

Freshness

Page 49: Growing into a proactive Data Platform

THANK YOU!

We are hiring

Page 50: Growing into a proactive Data Platform

YouTube.com/LivePersonDev

Twitter.com/LivePersonDev

Facebook.com/LivePersonDev

Slideshare.net/LivePersonDev