Download - Liveperson DLD 2015
DLD. Tel-Aviv. 2015
Making Scale a Non-Issue for Real-Time Data Apps
Vladi Feigin, LivePersonKobi Salant, LivePerson
Agenda
Intro About LivePerson Digital Engagements Call Center Use Case Architecture Zoom-In
Bio
Vladi Feigin System Architect in LivePerson 18 years in software development Interests : distributed computing, data, analytics and martial arts
Bio
Kobi Salant Data Platform Tech Lead in LivePerson 25 years in software development Interests : Application performance, traveling and coffee
LivePerson
We do Digital Engagements
Agile and very technological
Real Big Data and Analytics company
Really cool place to work in
One of the SaaS pioneers
6 Data Centers across the world
Founded in 1995, a public company since 2000 (NASDAQ: LPSN)
More than 18,000 customers worldwide
More than 1000 employees
LivePerson technology stack
We are Big Data
1.4 Million concurrent visits
1 Million events per second
2 billion site visits per month
27 million live engagements per month
Data freshness SLA (RT flow): up to 5 seconds
Visitor
Agent
Visitor
Agent
Call Center Operating
Digital engagement requires operating a call center in the most efficient way
How to operate a call center in the most efficient way? Provide operational metrics … In real-time
What are the challenges? Huge scale, load peaks, real-time calculations, high data freshness SLA
Call Center Operating
Architecture. Real-Time data flow
producer
(agent)
producer
(sess.)
producer
(chat)
Kafka
Storm
Cassandra
Storm
Fast topic
ElasticSearch CouchBase
API
Consistent topic
Batch layer
(Hadoop)
producer
(conv.)
producer
(other)
Custom Apps.
Chat History. Example
producer
(agent)
producer
(sess.)
producer
(chat.)
Kafka
Storm
Fast topic
ElasticSearch
API
Consistent topic
MR job Very low latency
99.5% of data High latency99.999% of data
Data Producers. Requirements Real time “Five nines” persistence Small footprint No interference with service Multiple producers & platforms Monolithic to service oriented
ManyMore
Services
Data Producers. Lessons learned
Hundreds of services Complex rollouts Minimal logic to avoid painful fixes Audit streaming? Split to buckets
Real time and “five nines” persistence are incompatible
In House
1
Bucket Bucket
Consistent Topic
Send message to Kafka
local file
Persist message to local disk
Kafka Bridge
Send message to Kafka
Fast Topic
Kafka Resilience
Real-time Customers
Offline Customers
Kafka
Data Producers. Flow
Data Model Framework
Why Avro: Schema based evolution Performance - Untagged bytes HDFS ecosystem support
Lessons Learned: Schema evolution breaks Big schema (ours is over 65k) not recommended Avoid deep nesting and multiple unions Need a framework
Chaos – Non-Schema space delimited
Order – Avro Schema
Framework Flow
1. Event is created according to Avro Schema version 3.5
2. Schema is registered into the repository (once)
3. Value 3.5 is written to header4. Event is encoded with schema
version 3.5 and added to message5. Message is sent to Kafka6. Message is read by consumer7. Header is read from message8. Schema is retrieved from repository
according to scheme version9. Event decoded using the proper Avro
schema10.Decoded event is processed
3.5
3.5
Consumer
Repository
Apache Kafka More than 15 billion events a day More than 1 million events per second Hundreds of producers & consumers
Why Kafka? Scale where traditional MQs fail Industry standard for big data log messaging Reliable, flexible and easy to use
Deployment: We have 15 clusters across the world Our biggest cluster has 8 nodes with more than 6TB (Avro + Kafka
compression) Maximum retention of 72 hours
Apache Kafka. Lessons Learned Scale horizontally for hardware resources and vertically for throughput
Look at trends of network & IO & Kafka's JMX statistics
Partitions Servers
Bytes in
Apache Kafka. Lessons Learned cont. Know your data and message sizes:
Large messages can break you Data growth can overfill your capacity Set the right configuration
Adding or removing a broker is not trivial
Decide on single or multiple topics
Apache Storm
Why Storm? Growing community with good integration to Kafka At the time, it was the leading product Easy development and customization The POC was successful
Deployment: We have 6 clusters across the world Our biggest cluster has more then 30 nodes We have 20 topologies on a single cluster Uptime of months for a single topology
Apache Storm. Typical topology
Storm Topology
KAFKA SPOUT FILTER BOLT WRITER BOLT
emit emit
ack ack
fetch
Zookeeper
Kafka Fast topic
writecommit
Apache Storm. Lessons learned Develop SDK and educate R&D Where did my topology run last week? What is my overtime capacity?
Know your bolts, must return a timely answer Coding is easy, performance is hard Use isolation
Capacity
Apache Storm. Lessons learned cont. Use local shuffling Use Ack
KAFKA SPOUT FILTER BOLT WRITER BOLT
KAFKA SPOUT FILTER BOLT WRITER BOLT
Local emit
ACKER BOLT
ACKER BOLT
COMM BOLT
COMM BOLT
Worker A
Worker B
Local emit
Local emit
Local emit
Summary
No one-size-fits-all solution Ask product for a clearly defined SLA Separate between fast and consistent data flows - they don’t merge!
Use schema for a data model - keep it flat and small Kafka rules! It’s reliable and fast - use it Storm has it’s toll. For some use-cases we would be using Spark Streaming today
THANK YOU!
We are hiring
http://www.liveperson.com/company/careers
Q/A
YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
Facebook.com/LivePersonDev
Slideshare.net/LivePersonDev