Download - Liveperson DLD 2015

DLD. Tel-Aviv. 2015

Making Scale a Non-Issue for Real-Time Data Apps

Vladi Feigin, LivePersonKobi Salant, LivePerson

Agenda

Intro About LivePerson Digital Engagements Call Center Use Case Architecture Zoom-In

Bio

Vladi Feigin System Architect in LivePerson 18 years in software development Interests : distributed computing, data, analytics and martial arts

Bio

Kobi Salant Data Platform Tech Lead in LivePerson 25 years in software development Interests : Application performance, traveling and coffee

LivePerson

We do Digital Engagements

Agile and very technological

Real Big Data and Analytics company

Really cool place to work in

One of the SaaS pioneers

6 Data Centers across the world

Founded in 1995, a public company since 2000 (NASDAQ: LPSN)

More than 18,000 customers worldwide

More than 1000 employees

LivePerson technology stack

We are Big Data

1.4 Million concurrent visits

1 Million events per second

2 billion site visits per month

27 million live engagements per month

Data freshness SLA (RT flow): up to 5 seconds

Visitor

Call Center Operating

Digital engagement requires operating a call center in the most efficient way

How to operate a call center in the most efficient way? Provide operational metrics … In real-time

What are the challenges? Huge scale, load peaks, real-time calculations, high data freshness SLA

Call Center Operating

Architecture. Real-Time data flow

producer

(agent)

producer

(sess.)

producer

(chat)

Kafka

Storm

Cassandra

Storm

Fast topic

ElasticSearch CouchBase

API

Consistent topic

Batch layer

(Hadoop)

producer

(conv.)

producer

(other)

Custom Apps.

Chat History. Example

producer

(agent)

producer

(sess.)

producer

(chat.)

Kafka

Storm

Fast topic

ElasticSearch

API

Consistent topic

MR job Very low latency

99.5% of data High latency99.999% of data

Data Producers. Requirements Real time “Five nines” persistence Small footprint No interference with service Multiple producers & platforms Monolithic to service oriented

ManyMore

Services

Data Producers. Lessons learned

Hundreds of services Complex rollouts Minimal logic to avoid painful fixes Audit streaming? Split to buckets

Real time and “five nines” persistence are incompatible

In House

1

Bucket Bucket

Consistent Topic

Send message to Kafka

local file

Persist message to local disk

Kafka Bridge

Send message to Kafka

Fast Topic

Kafka Resilience

Real-time Customers

Offline Customers

Kafka

Data Producers. Flow

Data Model Framework

Why Avro: Schema based evolution Performance - Untagged bytes HDFS ecosystem support

Lessons Learned: Schema evolution breaks Big schema (ours is over 65k) not recommended Avoid deep nesting and multiple unions Need a framework

Chaos – Non-Schema space delimited

Order – Avro Schema

Framework Flow

1. Event is created according to Avro Schema version 3.5

2. Schema is registered into the repository (once)

3. Value 3.5 is written to header4. Event is encoded with schema

version 3.5 and added to message5. Message is sent to Kafka6. Message is read by consumer7. Header is read from message8. Schema is retrieved from repository

according to scheme version9. Event decoded using the proper Avro

schema10.Decoded event is processed

3.5

3.5

Consumer

Repository

Apache Kafka More than 15 billion events a day More than 1 million events per second Hundreds of producers & consumers

Why Kafka? Scale where traditional MQs fail Industry standard for big data log messaging Reliable, flexible and easy to use

Deployment: We have 15 clusters across the world Our biggest cluster has 8 nodes with more than 6TB (Avro + Kafka

compression) Maximum retention of 72 hours

Apache Kafka. Lessons Learned Scale horizontally for hardware resources and vertically for throughput

Look at trends of network & IO & Kafka's JMX statistics

Partitions Servers

Bytes in

Apache Kafka. Lessons Learned cont. Know your data and message sizes:

Large messages can break you Data growth can overfill your capacity Set the right configuration

Adding or removing a broker is not trivial

Decide on single or multiple topics

Apache Storm

Why Storm? Growing community with good integration to Kafka At the time, it was the leading product Easy development and customization The POC was successful

Deployment: We have 6 clusters across the world Our biggest cluster has more then 30 nodes We have 20 topologies on a single cluster Uptime of months for a single topology

Apache Storm. Typical topology

Storm Topology

KAFKA SPOUT FILTER BOLT WRITER BOLT

emit emit

ack ack

fetch

Zookeeper

Kafka Fast topic

writecommit

Apache Storm. Lessons learned Develop SDK and educate R&D Where did my topology run last week? What is my overtime capacity?

Know your bolts, must return a timely answer Coding is easy, performance is hard Use isolation

Capacity

Apache Storm. Lessons learned cont. Use local shuffling Use Ack



Local emit

ACKER BOLT

ACKER BOLT

COMM BOLT

COMM BOLT

Worker A

Worker B

Local emit

Local emit

Local emit

Summary

No one-size-fits-all solution Ask product for a clearly defined SLA Separate between fast and consistent data flows - they don’t merge!

Use schema for a data model - keep it flat and small Kafka rules! It’s reliable and fast - use it Storm has it’s toll. For some use-cases we would be using Spark Streaming today

THANK YOU!

We are hiring

http://www.liveperson.com/company/careers

Q/A



YouTube.com/LivePersonDev

Twitter.com/LivePersonDev

Facebook.com/LivePersonDev

Slideshare.net/LivePersonDev

Download - Liveperson DLD 2015

Top Related