real-time analytics with spark - meetupfiles.meetup.com/18245106/real-time analytics with...

32
Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud Galway Data Meetup, 2015-02-03

Upload: others

Post on 22-May-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Real-time Analytics with Spark

Maciej Dabrowski, Chief Data Scientist, Altocloud !Galway Data Meetup, 2015-02-03

Page 2: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

2

MEETS A SMALL STARTUP

source: https://media.licdn.com/mpr/mpr/p/1/005/0a0/167/2f98d60.jpg

Page 3: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ We built predictive communications software that uses analytics to make customer interactions and experience better

Altocloud

3

Page 4: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Monitoring live users

4

Page 5: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

5

Page 6: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

6

ANALYTICS

source: http://olap.com/

Page 7: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Real-time for us is under 1-5s

‣ Q: How many customers are currently online?

‣ Q: How many chats/calls are taking place at the moment?

‣ Q: What is the utilisation of my customer support agents?

Use Case 1: Real-time analytics

7

Page 8: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Q: How many calls were offered in the last week?

‣ Q: What is the acceptance rate of my chat offers?

Use Case 2: Reporting

8

Page 9: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Q: Which customers currently on my site I should engage?

Use Case 3: Predictive Analytics

9

Page 10: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Scalability

‣ Limited resources

‣ Various analytics use cases

Technical challenges

10

Page 11: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

11

Real-time analytics with Hadoop

source: http://barbarashdwallpapers.com/funny-elephant-wallpapers/

Page 12: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

APIs

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Platform

12

MESSAGE QUEUES

FRONT-END APIs KAFKA

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

BACK-END APIS

APPS

BACK-END APIs

MONGODB

Page 13: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

DATA SOURCES

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Data Platform

13

MESSAGE QUEUES

FRONT-END APIs KAFKA

MONGODB OPLOG

SPARK

RABBIT MQ

CASSANDRA

SPARK STREAMING

HDFS

FRONT-END APIS

APPS

MONGODB

Page 14: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ One code base for streaming and batch processing

‣ Rich API in Scala/Python/Java

‣ Fast for iterative algorithms (important for ML)

‣ Growing community

‣ The concept of a micro-batch

‣ Nicely integrates with Kafka and Cassandra

‣ Fairly easy setup

Why Spark

14

Page 15: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

Spark components

15

Page 16: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Hadoop

!

!

!

!

!

!

‣ Spark

Word count in Spark

16

Page 17: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Example: user event aggregation stored in Cassandra

‣ Still much better than Hadoop!

What about something more useful?

17

Page 18: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ User activity is an input (e.g. page view)

‣ Users for multiple businesses online

‣ Scale 100s to 100 000s activities per second

‣ Response time under 5s

‣ A perfect use case for spark streaming

Counting users currently online

18

Page 19: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Pub-sub message broker

‣ Fast: 100s MBs /s on a single broker

‣ Scalable: partitioned data streams

‣ Durable: messages persisted and replicated

‣ Distributed: Strong durability with and fault-tolerance

‣ Downside: requires ZooKeeper

!see https://kafka.apache.org

Data source: Kafka

19

Page 20: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

!

!

!

!

!

!

!‣ Kafka with Spark: http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Spark and Kafka

20

Page 21: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Simple count unique events

!

!

‣ Count visit events for unique users

Count users online

21

Page 22: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Twitter Algebird to the rescue!

‣ HyperLogLog - a probabilistic data structure saving a lot of memory!

‣ https://github.com/twitter/algebird

Sets can take a lot of memory!

22

Page 23: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Easy to setup

‣ High availability - no master

‣ Great performance

‣ CQL - SQL like querying

‣ Great support and bug-free drivers from Datastax

‣ Key: Design your schema around queries; !!

see https://cassandra.apache.org

Storing your results

23

Page 24: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Datastax driver is very easy to use

!

!

‣ Save our results to Cassandra

Store data in Cassandra

24

Page 25: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

25source: http://top1walls.com

Page 26: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Spark streaming job performs two major tasks:

• data processing • data receiving

‣ Receiver always takes one core

‣ Technically, you need 2N cores to run N streaming jobs

‣ Not a big deal in production, what about testing?

Spark streaming

26

Page 27: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Containerise your app including all its dependencies

‣ Distribute your app in this standard container

‣ Run it on any machine with docker

‣ Very lightweight

Docker

27

Page 28: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

c3.xlarge: 4 cores

‣ AWS example

Spark

SPARK EXECUTOR

c3.large: 2 cores

SPARK DRIVER

SPARK EXECUTOR

CORE 1 CORE 2 CORE 3 CORE 4

Page 29: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

c3.xlarge: 4 cores

‣ AWS example

Spark on Docker

c3.large: 2 cores

SPARK DRIVER

CORE 1 CORE 2 CORE 3 CORE 4

docker-1: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

docker-2: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

SPARK EXECUTOR

Page 30: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ Spark Streaming is fast to deploy but tuning is VERY important

‣ The lower the number of tasks, the better (in general)

‣ When reading from Kafka make sure that you configure blockingInterval

‣ optimize your jobs when possible - similar jobs can be sometimes merged

‣ persist your data from workers, NOT the driver

Spark Streaming

30

Page 31: Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with Spark.pdf · Real-time Analytics with Spark Maciej Dabrowski, Chief Data Scientist, Altocloud

‣ OLAP-type queries using Spark SQL

‣ More advanced performance testing

‣ Detailed unit testing

‣ More batch jobs

Where do we go from here?

31