webinar - big data: let's smack - jorg schad

Big Data: Let's SMACK @joerg_schad

You can come and see me at Codemotion Milan 2017!

Talk: NO ONE PUTS Java IN THE

CONTAINER November 10th, 15:10-15.50

http://milan2017.codemotionworld.com/

Jörg SchadSoftware Engineer @Mesosphere

@joerg_schad

@joerg.mesosphere

MapReduce is crunching Data

Ancient Times...

But then business demanded FAST DATA

We need to turn faster!

Today...

Batch Event ProcessingMicro-Batch

Days Hours Minutes Seconds Microseconds

Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics

Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations

(Fast) Data Processing

Fast Data Pipeline

EVENTS

Ubiquitous data streams from connected devices

INGEST STOREANALYZE ACT

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

The SMACK Stack

EVENTS

INGEST

Apache Kafka

Apache Spark

ANALYZE

Apache Cassandra

applications

Apache Mesos

Sensors

Devices

Clients

Datacenter

NAIVE APPROACH

Typical Datacentersiloed, over-provisioned servers,

low utilization

Industry Average 12-15% utilization

microservice

Cassandra

Spark/Hadoop

MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS

Typical Datacentersiloed, over-provisioned servers,

low utilization

Mesos/ DC/OSautomated schedulers, workload multiplexing onto the

same machines

microservice

Cassandra

Spark/Hadoop

Apache Mesos• A top-level Apache project• A cluster resource negotiator• Scalable to 10,000s of nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support

MESOS: FUNDAMENTAL ARCHITECTURE

Mesos Master

Mesos AgentMesos Agent ServiceCassandra Executor

Cassandra Task

Cassandra Scheduler

Container Scheduler

Spark Scheduler

Spark Executor

Spark Task

Mesos AgentMesos Agent ServiceDocker

Executor

Docker Task

Spark Executor

Spark Task

Two-level Scheduling1. Agents advertise resources to Master2. Master offers resources to Framework3. Framework rejects / uses resources4. Agent reports task status to Master

Datacenter Operating System (DC/OS)

Distributed Systems Kernel (Mesos)

DC/OS ENABLES MODERN DISTRIBUTED APPS

Big Data + Analytics EnginesMicroservices (in containers)

Streaming

Machine Learning

Analytics

Functions & Logic

Search

Time Series

SQL / NoSQL

Databases

Modern App Components

Any Infrastructure (Physical, Virtual, Cloud)

The SMACK Stack

EVENTS

INGEST

Apache Kafka

Apache Spark

ANALYZE

Apache Cassandra

applications

Apache Mesos

Sensors

Devices

Clients

The SMACK Stack

EVENTS

applications

The SMACK Stack

EVENTS

applications

DATA PROCESSING AT HYPERSCALE

EVENTS

INGEST

Apache Kafka

Apache Spark

ANALYZE

Apache Cassandra

applications

Sensors

Devices

Clients

MESSAGE QUEUES

Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only:

AWS SQS Google Cloud Pub/Sub

APACHE KAFKA

High-throughput, distributed, persistent publish-subscribe messaging system

Originates from LinkedIn Typically used as buffer/de-coupling

layer in online stream processing

fluentd

● Scalability● Message Type● Log vs …

● Delivery Guarantees/Message durability

● Routing Capabilities● Failover● Community● Mesos Support ;-)

HOW TO CHOOSE?

DELIVERY GUARANTEES

At most once—Messages may be lost but are never redelivered.

At least once—Messages are never lost but may be redelivered.

Exactly once—this is what people actually want, each message is delivered once and only once.

Murphy’s Law of Distributed Systems:

Anything that can go wrong, will go wrong … partially!

RoutingSimple Pipes Routing

EVENTS

INGEST

Apache Kafka

Apache Spark

ANALYZE

Apache Cassandra

applications

Sensors

Devices

Clients

STREAM PROCESSING• Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,

Google Cloud Dataflow

APACHE SPARK

APACHE SPARK (STREAMING)

Typical Use: distributed, large-scale data processing; micro-batchingWhy Spark Streaming?

• Micro-batching creates very low latency, which can be faster

• Well defined role means it fits in well with other pieces of the pipeline

● Execution Model● Native Streaming vs Microbatch

● Fault Tolerance Granularity● Per record, per batch

● Delivery Guarantees● API

● SQL● Spark

● Performance….● Realtime ≠ Realtime

● Community● Mesos Support ;-)

HOW TO CHOOSE?

EXECUTION MODELMicro-Batching Native

Streaming

FAULT TOLERANCECheckpoint per “Batch”Ack-Per-Record Checkpoint per Batch

DELIVERY GUARANTEES“Exactly once”At least Once

EVENTS

INGEST

Apache Kafka

Apache Spark

ANALYZE

Apache Cassandra

applications

Sensors

Devices

Clients

Datastores

Data ModelKey-Value GraphRelational Document

● Schema

● SQL

● Foreign Keys/Joins

● OLTP/OLAP

● Simple

● Scalable

● Cache

FilesTime-Series● Complex

relations

● Social Graph

● Recommendation

● Fraud detections

● Schema-Less

● Semi-structured queries

● Product catalogue

● Session data

Demo Time

Generator Display

1. Financial data created by generator

2. Written to Kafka topics

3. Kafka Topics consumed by Flink

4. Flink pipeline operates on Kafka data

5. Results written back into Kafka stream (another topic)

6. Results displayed

Keep it running!

SERVICE OPERATIONS

● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages

CODEMOTION MILAN November 10/11th,2017http://milan2017.codemotionworld.com/

See you at

webinar - big data: let's smack - jorg schad

Technology

genz smack

reglament smack ball

martina schad unfallkasse hessen partner für sicherheit 1...

digital trends & quiz jorg verweij

schad leverages iot with toronto airport authority

brian jorg outdoors

biztalk smack down full

jorg sabellicus - Πρακτική Μαγεία

energa libre-smack-el-pontenciador-d

smack brochure

smack 2011

spark summit eu talk by jorg schad

natalia almonte - smack mellon

smack julekatalog 2013 for utsendelse

a smack...

smack talk, big data, and localization

jesień linuksowa 2013, smack

smack down

smack: behind the refactorings

schad v. arizona, 501 u.s. 624 (1991)