spark summit eu talk by bas geerdink

27
Lambda Architecture in the IoT Fast Data Analytics with Spark Streaming and MLlib Bas Geerdink ING

Upload: spark-summit

Post on 16-Apr-2017

424 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark Summit EU talk by Bas Geerdink

Lambda Architecture in the IoTFast Data Analytics with Spark Streaming and MLlib

Bas GeerdinkING

Page 2: Spark Summit EU talk by Bas Geerdink

Bas Geerdink

• Chapter Lead in Analytics area at ING

• Master degree in Artificial Intelligence and

Informatics

• Spark Certified Developer

• @bgeerdink

• https://www.linkedin.com/in/geerdink

Who am I?

Page 3: Spark Summit EU talk by Bas Geerdink

The Internet of …

Page 4: Spark Summit EU talk by Bas Geerdink

• Data

– Streaming data from more sources

• Use cases

– Combining data streams

• Technology

– Fast processing and scalability

• Challenges

– Security: encrypt the sensors/network/server

What’s new in the IoT?

Page 5: Spark Summit EU talk by Bas Geerdink

What do we want in the IoT?

• FAST data: process stream of events (sensory data)

• BIG data: process files/tables in batches (static data)

• ONE analytics engine

• ONE querying/visualization tool

• Scalable environment

• Reactive software

Page 6: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

Source: Nathan Marz (2013)

Page 7: Spark Summit EU talk by Bas Geerdink

Gamma/Kappa/Omega/…

Architecture• Lambda Criticism:

– Too complex

– Two code bases

– Two data stores

• Alternatives:– Treat a stream as a mini-batch

– Treat a batch as a stream of records

– Combine batch and stream within one system

Page 8: Spark Summit EU talk by Bas Geerdink

Big Data Reference Architecture

Source: Geerdink (2013)

Page 9: Spark Summit EU talk by Bas Geerdink

Use case: Smart Parking

Recommend the best car park when driving to Amsterdam

Show the car park with the highest score, determined by

• Batch (every x minutes), data from car parks

• Stream (every x seconds), GPS data of cars

Page 10: Spark Summit EU talk by Bas Geerdink
Page 11: Spark Summit EU talk by Bas Geerdink
Page 12: Spark Summit EU talk by Bas Geerdink

Stream Process

• Get car events (GPS data)

• Filter events (business rules)

• Store events

• For each car park in the neighborhood:

– Get feature set (location, capacity, usage, …)

– Combine event with previous events (running sum)

– Update feature set

– Predict score (machine learning) and update database

Page 13: Spark Summit EU talk by Bas Geerdink

Batch Process

• Get car park data

• Clean up (remove old car data)

• For each car park in the data set:

– Get recent set of car data (running sum)

– Update feature set

– Predict score (machine learning) and update database

Page 14: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 15: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 16: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 17: Spark Summit EU talk by Bas Geerdink

Event Processing with Kafka

Page 18: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 19: Spark Summit EU talk by Bas Geerdink

Stream: Data Preparation

Page 20: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 21: Spark Summit EU talk by Bas Geerdink

Stream: Create Recommendation

Page 22: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 23: Spark Summit EU talk by Bas Geerdink

Batch Processing

Page 24: Spark Summit EU talk by Bas Geerdink

Lambda Architecture

GPS

updates

Message

Broker

Analytics

Engine

File

System

Capacity

updates

Data Science

Environment

Machine learning models

Business Rules

Environment

Scores

Batch

Stream

1.

Get

features

2.

Apply

rules

3.

Predict

score

Business rules

Features

Selection

APIFront-end

Data Scientist

User

Page 25: Spark Summit EU talk by Bas Geerdink

Data Science with Zeppelin

Page 26: Spark Summit EU talk by Bas Geerdink

Summary

• The IoT requires new architecture principles: in-memory,

distributed, reactive and scalable are the new normal

• Lambda/Kappa/Omega reference architectures are

designed for combining batch and streaming data flows

• Open source tooling such as Kafka, Cassandra, and

Spark adhere to these principles and design patterns

Page 27: Spark Summit EU talk by Bas Geerdink

THANK YOU!ING is hiring

#SparkSummit