harnessing the power of spark and cassandra in your spring app

Post on 29-Jan-2018

194 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Harnessing the Power of Spark + Cassandra within

your Spring AppSteve Pember

CTO, ThirdChannel @svpember

@spring_io #springio17

RELATIONAL DATABASES ARE FANTASTIC

SQL MAKES YOU STRONG

@spring_io #springio17

@spring_io #springio17

Agenda

• Spark • Cassandra • Spark + Cassandra • Working with Spark + Cassandra • Demo

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

@spring_io #springio17

Hadoop Spark

• Map / Reduce • Storage via HDFS • Each calculation

step written to disk

• More than Map/Reduce

• No dependent storage mechanism

• Clustered Calculations, each step in memory

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

@spring_io #springio17

@spring_io #springio17

Your Spring App

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

@spring_io #springio17

THE SPARKCONTEXT SUBMITS JOBS TO THE CLUSTER

@spring_io #springio17

OPERATIONS ARE PERFORMED AGAINST RDDS

@spring_io #springio17

Resilient Distributed Dataset

• Immutable • Partitioned • Parallel operations • Created by performing operations on

other RDDs • Reusable & Composable

@spring_io #springio17

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

• APIs

@spring_io #springio17

MORE THAN MAP/REDUCE

@spring_io #springio17

RDD operations

• map • reduce • aggregate • filter • flatmap • join • … plus many more

@spring_io #springio17

@spring_io #springio17

Apache Spark

• Distributed Execution Engine

• What about Hadoop?

• Creation was a Happy Accident

• Architecture

• Programatic structure

• APIs

• Additional Modules

@spring_io #springio17

SPARK SQL…!

@spring_io #springio17

@spring_io #springio17

JDBC?

@spring_io #springio17

SPARK STREAMING!

@spring_io #springio17

@spring_io #springio17

Agenda

• Spark

• Cassandra

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

@spring_io #springio17

DETERMINISTIC DISTRIBUTION

@spring_io #springio17

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

• High Replication

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

• High Replication

• High Durability

@spring_io #springio17

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

@spring_io #springio17

EACH NEW NODE RESULTS IN INCREASED STORAGE WITH NO LOSS IN PERFORMANCE

@spring_io #springio17

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

• Data Model (CQL)

@spring_io #springio17

COLUMN ORIENTED DATABASE

@spring_io #springio17

BUT IT’S SQL-LIKE!

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

QUERYING

@spring_io #springio17

C* Querying

• select * from … • all queries must include partition key(s) • order by limited to group keys

@spring_io #springio17

Apache Cassandra (C*)

• NoSql Datastore

• Distributed

• High Replication

• High Durability

• Linear Scalability

• Data Model (CQL)

• Designing your Data Model

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

Agenda

• Spark

• Cassandra

• Spark + Cassandra

@spring_io #springio17

Spark + Cassandra

– Reduce each other’s weaknesses – Filter on the server side (with c*) – Join tables, filter results (with Spark)

@spring_io #springio17

COMPANIES HAVE BEEN FORMED

@spring_io #springio17

CLUSTER DESIGN

@spring_io #springio17

DATA LOCALITY!

@spring_io #springio17

@spring_io #springio17

@spring_io #springio17

PIPELINE ARCHITECTURE

@spring_io #springio17

@spring_io #springio17

Agenda

• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

@spring_io #springio17

OPTIONS FOR SPRING?

@spring_io #springio17

@spring_io #springio17

BUT WE DIDN’T GO THAT ROUTE

@spring_io #springio17

Our Excuses

• Wanted to take full advantage of Spark + C* connector

• Our setup / pipeline is relatively minimal • Programming model is easy

@spring_io #springio17

@spring_io #springio17

CODING SPARK + C*

@spring_io #springio17

• SparkConf • JavaSparkContext • JavaFunctions • Mappers

@spring_io #springio17

@spring_io #springio17

Spark Conf• spark.master -> url to the master node • spark.app.name -> want to see your client show up in

the Spark UI? • spark.executor.memory -> Limits memory per

executor on workers • spark.executor.cores -> limits cores on each worker

(need to share with c*!) • spark.submit.deployMode -> ‘client’ or ‘cluster • spark.jars.packages -> maven / gradle type names • spark.jars.ivy -> specify custom repos for packages • more at: http://spark.apache.org/docs/latest/

configuration.html#available-properties

@spring_io #springio17

Master Url Overloading

• “local” -> use Spark in stand alone mode. One thread

• “local[<K>]” -> Spark, stand alone, with K threads

• “local[*]” -> Spark, stand alone, with ALL YOUR THREADS!

• “spark://<host string>:<port>” -> url for a Spark cluster master node, using Spark’s cluster management

• also options for Mesos and Yarn

@spring_io #springio17

@spring_io #springio17

HOWEVER, A WARNING

@spring_io #springio17

MOST DIFFICULT PART: WHERE DOES MY CODE LIVE?

@spring_io #springio17

@spring_io #springio17

CLASS_PATH: org.apache.spark, com.fasterxml.jackson, com.yourco.yourapp.pojos.*

CLASS_PATH: org.apache.spark, com.fasterxml.jackson

CLASS_PATH: org.apache.spark, com.fasterxml.jackson

@spring_io #springio17

Agenda

• Spark

• Cassandra

• Spark + Cassandra

• Working with Spark + Cassandra

• Demo

Thank You!

@svpember

@spring_io #springio17

Links• Cassandra on AWS official Whitepaper: https://d0.awsstatic.com/whitepapers/Cassandra_on_AWS.pdf

• Demo Code project link: https://github.com/spember/spark-cass-spring-demo

@spring_io #springio17

Images• Database Sharding: https://dzone.com/articles/ebay-secret-database-scaling

• Indian Jones Warehouse: http://logisticalfictions.tumblr.com/page/9

• Strong (Spongebob): www.reactiongifs.com/strongbob/?utm_source=rss&utm_medium=rss&utm_campaign=strongbob

• Cheetah: www.livescience.com/21944-usain-bolt-vs-cheetah-animal-olympics.html

• Big Data Cartoon: http://www.kdnuggets.com/2016/08/cartoon-make-data-great-again.html

• Spark Streaming: http://velvia.github.io/presentations/2015-filodb-spark-streaming/#/

• Picard + Riker: http://www.douxreviews.com/2015/09/star-trek-next-generation-matter-of.html

• Software Engineers: http://pyxurz.blogspot.com/2011/10/office-space-page-2-of-6.html

• Throwing Money: https://vimeo.com/132892478

top related