spark with couchbase to electrify your data processing: couchbase connect 2015

SPARK WITH COUCHBASETO ELECTRIFY YOUR DATA PROCESSING

Michael Nitschinger, Couchbase

What is Spark?

Introduction

Apache Spark is a fast and general engine for large-scale data processing.

More Facts Over 450 contributors, very active Apache Big Data

project. Huge public interest:

Source: http://www.google.com/trends/explore?hl=en-US#q=apache%20spark,%20apache%20hadoop&cmpt=q

Community

Ecosystem growing fast Hadoop RDBMS NoSQL

Package Repository http://spark-packages.org/ Connectors Utility Libraries

Components: Spark Core

Resilient Distributed DatasetsClusteringExecution

Components: Spark SQL

Structured Data FramesDistributed querying with SQL

Components: Spark Streaming

Fault-tolerant streaming applications

Components: Spark MLib

Built-In Machine Learning Algorithms

Components: Spark GraphX

Graph processing and graph-parallel computations

How does it work? Resilient Distributed Datatypes paper:

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects

build DAG

agnostic to operators!

doesn’t know about stages

DAGScheduler

split graph into stages of tasks

submit each stage as ready

TaskScheduler

TaskSet

launch tasks via cluster manager

retry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Why should you care?

Spark Benefits

Linearly scalable to 1000+ worker nodes Simpler to use than Hadoop MR Only partial recompute on failure

For developers and data scientists machine learning R integration

Tight but not mandatory Hadoop integration Sources, Sinks Scheduler

Spark vs Hadoop

Spark is RAM while Hadoop is mainly HDFS (disk) bound

Fully compatible with Hadoop Input/Output

Easier to develop against thanks to functional composition

Hadoop certainly more mature, but Spark ecosystem growing fast

Ecosystem Flexibility

StreamsWeb APIs

DCPKVN1QLViews

BatchingData Archive

OLTP Data

Infrastructure Consolidation

The Couchbase Spark Connector

Couchbase Connector Spark Core

Automatic Cluster and Resource Management Creating and Persisting RDDs Java APIs in addition to Scala (planned before GA)

Spark SQL Easy JSON handling and querying Tight N1QL Integration (partially in dp2, fully planned before

Spark Streaming Persisting DStreams DCP source (partially in dp2, fully planned before GA)

Facts Current Version: 1.0.0-dp2 Beta in July, GA in Q3 (tentative)

Code: https://github.com/couchbaselabs/couchbase-spark-connector

Docs until GA: https://github.com/couchbaselabs/couchbase-spark-connector/wiki

Connection Management

Creating RDDs

Persisting RDDs

Spark SQL Integration

Spark Streaming with DCP

Questions?

Thank you.

spark with couchbase to electrify your data processing: couchbase connect 2015

Documents

couchbase meetup - "introduzione a nosql e couchbase"

spark and couchbase: augmenting the operational database...

sizing your couchbase cluster: couchbase connect 2014

couchbase at scale at ebay: couchbase connect 2014

visual analytics with tableau & couchbase: couchbase connect...

electrify your profitability masterclass

couchbase chennai meetup #3 what's new in couchbase server...

couchbase and apache spark

nexans: electrify the future

electrify albania start-up

couchbase chennai meetup 2 - couchbase - mobile

scaling with couchbase, kafka and apache spark

what’s new in spark 2.0?files.meetup.com/19070069/161129 -...

stream processing with spark and storm: couchbase connect...

couchbase live europe 2015: couchbase 101

couchbase server and spark machine learning meetup

databricks: exploring all the ways to analyze data with...

introduction to couchbase mobile: couchbase connect 2014

accessing iot data with couchbase server, couchbase mobile...

spark and couchbase– augmenting the operational database...