data pipelines with spark & datastax enterprise

Simon Ambridge

Data Pipelines With Spark & DSEAn Introduction To Building Agile, Flexible and Scalable Big Data and Data Science PipelinesVersion 0.8

Certified Apache Cassandra and DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world.Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and [email protected]

@stratman1958

Simon AmbridgePre-Sales Solution Engineer, Datastax UK

Introduction To Big Data Pipelines

Big, Static Data

Fast, Streaming Data

Big Data Pipelining: ClassificationBig Data Pipelines can mean different things to different people

Repeated analysis on a static but massive dataset• An element of research – e.g. genomics, clinical trial,

demographic data• Typically repetitive, iterative, shared amongst data

scientists for analysis

Real-time analytics on streaming data • Industrialised or commercial processes – sensors, tick

data, bioinformatics, transactional data, real-time personalisation

• Happening in real-time, data cannot be dropped or lost

Static Datasets

All You Can Eat? Really.

Static Data Analytics : Traditional Tools

Repeated iterations, at each stageRun/debug cycle can be slow

Sampling Modeling InterpretTuning Reporting

Re-sample

Typical traditional ‘static’ data analysis model

Data

Results

Static Data Analytics : Scale Up Challenges

Sampling and analysis often run on a single machine

• CPU and memory limitations

Limited sampling of a large dataset because of data size limitations

• Multiple iterations over large datasets is frequently not an ideal approach

Static Data Analytics : Traditional Scaling

DATA (GB)

DATA (MB)

DATA (TB)

Small datasets, small servers

Large datasets, large servers

Static Data Analytics: Big Data Problems

Data is getting Really Big!

• Data volumes are getting larger!

• The number of data sources is exploding!

• More data is arriving faster!

Scaling up is becoming impractical

• Physical limits• Datalimits• The validity of the analysis becomes obsolete, faster

Static Data Analytics : Big Data Needs

We need scalable infrastructure + distributed technologies

• Data volumes can be scaled

• Distribute the data across multiple low-cost machines

• Faster processing

• More complex processing

• No single point of failure

Static Data Analytics : DSE DeliversBuilding a distributed data processing framework can be a complex task!

It needs to be:

• Scalable

• Fast in-memory processing

• Replicated for resiliency

• Batch and real-time data feeds

• Ad-hoc queries

DataStax delivers an integrated analytics platform

Cassandra: THE Web, IoT & Cloud Database

What is Apache Cassandra?

• Very fast • Extremely resilient

• Across multiple data centres • No single point of failure• Continuous Availability, Disaster Avoidance

• Linear scale• Easy to operate

Enterprise Cassandra platform from Datastax

DataStaxEnterprise

DataStax Enterprise: Editions

DataStax Enterprise Standard

• DSE Standard is DataStax’s entry level commercial database offering

• Represents the minimum recommended to deploy Cassandra in a production environment

DataStax Enterprise Max

• DSE Max is DataStax’s advanced commercial database offering

• Designed for production Cassandra environments that have mixed workload requirements

Spark: THE Analytics Engine

What is Apache Spark?

• Distributed in-memory analytic processing• Batch and streaming analytics• Fast - 10x-100x faster than Hadoop MapReduce • Rich Scala, Java and Python APIs

Tightly integrated with DSE

Spark: Dayton Gray Sort ContestDayton Gray benchmark - tests how fast a system can sort 100 TB of data (1 trillion records)

• Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes

• 2014: Spark completed the benchmark in 23 minutes on just 206 EC2 nodes = 3X faster using 10X fewer machines

• Spark sorted 1 PB (10 trillion records) on 190 machines in < 4 hours. Previous Hadoop MapReduce time of 16 hours on 3800 machines = 4X faster using 20X fewer machines

DataStax Enterprise: Analytics Integration

Cassandra Cluster

Spark Cluster

ETL

Spark Cluster

• Tight integration• Data locality• Microsecond response times

X

• Apache Cassandra for Distributed Persistent Storage• Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required

X• Loose integration• Data separate from processing• Millisecond response times

“Latency when transferring data is unavoidable. The trick is to reduce the latency to as close to zero as possible…”

Static Data Analytics : Requirements

Valid data pipeline analysis methods must be:

Auditable• Reproducible • Documented

Controlled• Version control

Collaborative• Accessible

Notebooks: Features

What are Notebooks?

• Drive your data analysis from the browser• Highly interactive• Tight integration with Apache Spark• Handy tools for analysts:

• Reproducible visual analysis• Code in Scala, CQL, SparkSQL, Python • Charting – pie, bar, line etc• Extensible with custom libraries

Example: Spark Notebook

Cells

Markdown

Output

Controls

Static Data Analytics : Approach

Example architecture & requirements

1. Optimised source data format

2. Distributed in-memory analytics

3. Interactive and flexible data analysis tool

4. Persistent data store

5. Visualisation tools

Static Data Analytics : Example

ADAMNotebook Persistent Storage

OLTP Database Visualisation

Genome research platform - ADST (Agile Data Science Toolkit)

Static Data Analytics : Pipeline Process Flow

3. Persistent data storage

2. Interactive, flexible and reproducible analysis

1. Source data

4. Visualise and analyse

Static Data Analytics : Pipeline Scalability

• Add more (physical or virtual) nodes as required to add capacity

• Container tools ease configuration management and deployment

• Scale out quickly

Static Data Analytics : Now

• No longer an iterative process constrained by hardware limitations

• Now a more scalable, resilient, dynamic, interactive process, easily shareable

Analyse

The new model for large-scale static data analytics

Share

XLoad

SCALE & DISTRIBUTE PROCESSING

Real-Time Datasets

If it’s Not “Now”, Then It’s Probably Already Too Late

Big Data Pipelining: Why Real-Time?

• React to customers faster and with more accuracy

• Reduce risk through more accurate understanding of the market

• Optimise return on marketing investment

• Faster time to market

• Improve efficiency

In a highly connected world

In most cases ‘real-time’ data changing at <1s intervals

Big Data Pipelining: Real-Time Analytics

• Capture, prepare, and process fast streaming data

• Different approach from traditional batch processing

• The speed of now – cannot wait

• Immediate insight, instant decisions

What problem are we trying to solve?

Big Data Pipelining: Real-Time Use Cases

Sensor data (IoT)

Transactional data

User Experience

Social media

Use cases for streaming analytics

Big Data Analytics: Streams

Data tidal waves!Netflix• Ingests Petabytes of data per day• Over 1 TRILLION transactions per day (>10 m per second) into DSE

Data streams?

Data torrent?

Big Data Pipelining: Real-Time architecture

Analytics in real-time, at scale

Fast processing, distributed, in-memory

Increasingly using a technology stack comprising Kafka, Spark and Cassandra

• Scalable

• Distributed

• Resilient

Streaming analytics architecture - what do we need?

Kafka: Architecture

How Does Kafka Work?

Kafka “De-couples” producers and consumers in data pipelines

’Producers’ send messages to the Kafka cluster, which in turn serves them up to ’Consumers’

• Kafka maintains feeds of messages in categories called topics• A Kafka cluster is comprised of one or more servers called a broker

Producer

Producer

Producer

Consumer

Consumer

Consumer

Kafka Cluster

Kafka: Streaming With Spark

Kafka writes, Spark reads

• Topics can have multiple partitions• Each topic partition stored as a log (an ordered set of messages)• Messages are simply byte arrays, so can store any object in any format• Each message in a partition is assigned a unique offset

Spark consumes messages as a stream, in micro batches, saved as RDD’s

1 2 3 4 5 6 7 8Partition 0

1 2 3 4 5 6 7 8Partition 1

1 2 3 4 5 6Partition 0

Temperature Topic

Rainfall Topic

Temperature Consumer

Rainfall Consumer

Temperature Consumer

DataStax Enterprise: Streaming Schematic

SensorNetwork

SignalAggregation

ServicesMessaging Queue

Sensor Data QueueManagement

Broker

Broker

Collection Service

Data StorageOLTP Persistence Layer

Streaming DataIngest

DataStax Enterprise: Streaming Analytics

Real-timeAnalytics

Persistent Storage OLTP Database

!$£€!

Personalisation

Actionable insight Monitoring

Web / Analytics / BI

DataStax Enterprise: Multi-DC UsesDC: EUROPEDC: USA

Real-time active-active geo-replicationacross physical datacentres

4 3

25

1

4 3

25

1

8

1

2

3

4

5

6

7

1

2

3

OLTP:Cassandra

5

4

Analytics:Cassandra + Spark

Replication

Replication

Workload separation via virtual datacentres

Real-Time Analytics: DSE Multi-DCWorkload Management and Separation With DSE

Analytics / BI

Analytics Datacentre

OLTP Datacentre

100% Uptime, Global Scale

OLTPReal-Time Analytics

Mixed Load OLTP and Analytics Platform

Replication

Replication

JDBC ODBC

Separation of OLTP from Analytics

Social Media

IoTPersonalisation & Persistence

Personalisation

!$£€!Actionable insight

Monitoring

App, Web

DSE & Analytics : Summary

Static, Massive Data

Scalable Data Pipelines1. Optimised data storage formats2. Scalable, distributed technologies3. Flexible and interactive analysis tools4. Resilient, persistent Storage

Real-Time Streaming Data

Scalable Data Pipelines1. Scalable, distributed technologies2. De-coupled Producers and Consumers3. Real-Time analytics4. Resilient, persistent Storage

Spark

Mesos

Akka

Cassandra

Kafka

Thank you!

data pipelines with spark & datastax enterprise

Technology