apache cassandra and spark for simple, distributed, near real-time
TRANSCRIPT
![Page 1: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/1.jpg)
Data at the Speed of your UsersApache Cassandra and Spark for simple, distributed, near real-time stream processing.
GOTO Copenhagen 2014
![Page 2: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/2.jpg)
Rustam Aliyev
Solution Architect at . !
! @rstml
![Page 3: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/3.jpg)
Big Data?
Photo: Flickr / Watches En Masse
![Page 4: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/4.jpg)
" Volume
# Variety
$ Velocity
![Page 5: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/5.jpg)
Velocity = Near Real Time
![Page 6: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/6.jpg)
Near Real Time?
![Page 7: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/7.jpg)
0.5 sec ≤ ≤ 60 secNear Real Time
![Page 8: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/8.jpg)
Use Cases
Photo: Flickr / Swiss Army / Jim Pennucci
![Page 9: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/9.jpg)
Web Analytics Dynamic Pricing
Recommendation Fraud Detection
![Page 10: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/10.jpg)
Architecture
Photo: Ilkin Kangarli / Baku Haydar Aliyev Center
![Page 11: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/11.jpg)
Architecture Goals
Low Latency
High Availability
Horizontal Scalability
Simplicity
![Page 12: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/12.jpg)
Stream Processing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection Processing Storing Delivery
![Page 13: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/13.jpg)
Stream Processing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Collection!
!
Spark
!
CassandraDelivery
![Page 14: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/14.jpg)
Cassandra Distributed Database
Photo: Flickr / Hypostyle Hall / Jorge Láscar
![Page 15: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/15.jpg)
Data Model
![Page 16: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/16.jpg)
Partition
Cell 1 Cell 2 …Cell 3Partition Key
![Page 17: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/17.jpg)
Partition
os: Android
storage: 32GB
version: 4.4
weight: 130g
sort order on disk
Nexus 5
![Page 18: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/18.jpg)
Table
os: Android
storage: 32GB
version: 4.4
weight: 130gNexus 5
os: iOS
storage: 64GB
version: 8.0
weight: 129giPhone 6
os: memory: version: weight: Other
![Page 19: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/19.jpg)
Distribution
![Page 20: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/20.jpg)
0000
8000
4000C000
2000
6000
E000
A000
3D97Nexus 5
![Page 21: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/21.jpg)
0000
8000
4000C000
2000
6000
E000
A000
9C4FiPhone 6
3D97
![Page 22: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/22.jpg)
Replication
![Page 23: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/23.jpg)
0000
8000
4000C000
2000
6000
E000
A000
3D97
9C4F
1 replica
![Page 24: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/24.jpg)
0000
8000
4000C000
2000
6000
E000
A000
3D97
9C4F
9C4F
3D97
2 replicas
![Page 25: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/25.jpg)
Spark Distributed Data Processing Engine
Photo: Flickr / Sparklers / Alexandra Compo / CreativeCommons
![Page 26: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/26.jpg)
Fast
In-memory
![Page 27: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/27.jpg)
Logistic Regression
Run
ning
Tim
e (s
)
1000
2000
3000
4000
Number of Iterations
1 5 10 20 30
SparkHadoop
![Page 28: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/28.jpg)
Easy
![Page 29: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/29.jpg)
map !
reduce
![Page 30: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/30.jpg)
map filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip
sample take first partitionBy mapWith pipe save ...
![Page 31: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/31.jpg)
RDD Resilient Distributed Datasets
Node 1
Node 2
Node 3Node 1
Node 2
Node 3
![Page 32: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/32.jpg)
Operator DAG
groupBy
join
filtermap
Disk RDD Memory RDD
![Page 33: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/33.jpg)
Spark Streaming
Micro-batching
![Page 34: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/34.jpg)
RDD
DStreamData Stream
![Page 35: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/35.jpg)
Spark + Cassandra
DataStax Spark Cassandra Connector
![Page 36: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/36.jpg)
https://github.com/datastax/spark-cassandra-connector
![Page 37: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/37.jpg)
M
M
M
Cassandra
Spark Worker
Spark Master & Worker
![Page 38: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/38.jpg)
Demo!
! Twitter Analytics
![Page 39: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/39.jpg)
Cassandra Data Model
![Page 40: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/40.jpg)
ALL: 7139
2014-09-21: 220
2014-09-20: 309
2014-09-19: 129
sort order
#hashtag
![Page 41: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/41.jpg)
CREATE TABLE hashtags ( hashtag text, interval text, mentions counter, PRIMARY KEY((hashtag), interval) ) WITH CLUSTERING ORDER BY (interval DESC);
![Page 42: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/42.jpg)
Processing Data Stream
![Page 43: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/43.jpg)
import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(
![Page 44: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/44.jpg)
import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(
![Page 45: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/45.jpg)
import com.datastax.spark.connector.streaming._ !val sc = new SparkConf() .setMaster("spark://127.0.0.1:7077") .setAppName("Twitter-‐Demo") .setJars("demo-‐assembly-‐1.0.jar")) .set("spark.cassandra.connection.host", "127.0.0.1") !val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra(
![Page 46: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/46.jpg)
!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()
![Page 47: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/47.jpg)
!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsByDay = tagCounts.map{ case (tag, mentions) => (tag, mentions, DateTime.now.toString("yyyyMMdd")) } !tagCountsByDay.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()
![Page 48: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/48.jpg)
!val ssc = new StreamingContext(sc, Seconds(2)) !val stream = TwitterUtils. createStream(ssc, None, Nil, storageLevel = StorageLevel.MEMORY_ONLY_SER_2) !val hashTags = stream.flatMap(tweet => tweet.getText.toLowerCase.split(" "). filter(tags.contains(Seq("#iphone", "#android")))) !val tagCounts = hashTags.map((_, 1)).reduceByKey(_ + _) !val tagCountsAll = tagCounts.map{ case (tag, mentions) => (tag, mentions, "ALL") } !tagCountsAll.saveToCassandra( "demo_ks", "hashtags", Seq("hashtag", "mentions", "interval")) !ssc.start() ssc.awaitTermination()
![Page 49: Apache Cassandra and Spark for simple, distributed, near real-time](https://reader033.vdocuments.site/reader033/viewer/2022052707/58a1b38e1a28ab244d8c4a48/html5/thumbnails/49.jpg)
Questions ?