![Page 1: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/1.jpg)
@doanduyhai
Spark/Cassandra connector API, Best Practices & Use-Cases DuyHai DOAN, Technical Advocate
![Page 2: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/2.jpg)
@doanduyhai
Who Am I ?!Duy Hai DOAN Cassandra technical advocate • talks, meetups, confs • open-source devs (Achilles, …) • OSS Cassandra point of contact
☞ [email protected] ☞ @doanduyhai
2
![Page 3: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/3.jpg)
@doanduyhai
Datastax!• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
![Page 4: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/4.jpg)
@doanduyhai
Agenda!• Connector API by example
• Best practices
• Use-cases
• The “Big Data Platform”
4
![Page 5: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/5.jpg)
Spark/Cassandra connector API!
Spark Core!Spark SQL!
Spark Streaming!!
![Page 6: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/6.jpg)
@doanduyhai
Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support
6
![Page 7: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/7.jpg)
@doanduyhai
Connector architecture – Core API!Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapping of Cassandra tables and rows to Scala objects • CassandraRow • Scala case class (object mapper) • Scala tuples
7
![Page 8: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/8.jpg)
Spark Core
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 9: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/9.jpg)
@doanduyhai
Connector architecture – Spark SQL !
Mapping of Cassandra table to SchemaRDD • CassandraSQLRow à SparkRow • custom query plan • push predicates to CQL for early filtering
SELECT * FROM user_emails WHERE login = ‘jdoe’;
9
![Page 10: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/10.jpg)
Spark SQL
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 11: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/11.jpg)
@doanduyhai
Connector architecture – Spark Streaming !
Streaming data INTO Cassandra table • trivial setup • be careful about your Cassandra data model when having an infinite
stream !!!
Streaming data OUT of Cassandra tables (CDC) ? • work in progress …
11
![Page 12: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/12.jpg)
Spark Streaming
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 13: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/13.jpg)
Q & R
! " !
![Page 14: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/14.jpg)
Spark/Cassandra best practices!
Data locality!Failure handling!
Cross-region operations!
![Page 15: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/15.jpg)
@doanduyhai
Cluster deployment!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Stand-alone cluster
15
![Page 16: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/16.jpg)
@doanduyhai
Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
16
![Page 17: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/17.jpg)
@doanduyhai
Data Locality!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Spark partition RDD
Cassandra tokens ranges
17
![Page 18: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/18.jpg)
@doanduyhai
Data Locality!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Use Murmur3Partitioner
18
![Page 19: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/19.jpg)
@doanduyhai
Read data locality!Read from Cassandra
19
![Page 20: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/20.jpg)
@doanduyhai
Read data locality!Spark shuffle operations
20
![Page 21: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/21.jpg)
@doanduyhai
Write to Cassandra without data locality!
Async batches fan-out writes to Cassandra
21
Because of shuffle, original data locality is lost
![Page 22: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/22.jpg)
@doanduyhai
Or repartition before write !
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")
22
![Page 23: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/23.jpg)
@doanduyhai
Write data locality!
23
• either stream data in Spark layer using repartitionByCassandraReplica() • or flush data to Cassandra by async batches • in any case, there will be data movement on network (sorry no magic)
![Page 24: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/24.jpg)
@doanduyhai
Joins with data locality!
24
CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));
CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))
![Page 25: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/25.jpg)
@doanduyhai
Joins pipeline with data locality!
25
LOCAL READ FROM CASSANDRA
![Page 26: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/26.jpg)
@doanduyhai
Joins pipeline with data locality!
26
SHUFFLE DATA WITH SPARK
![Page 27: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/27.jpg)
@doanduyhai
Joins pipeline with data locality!
27
REPARTITION TO MAP CASSANDRA REPLICA
![Page 28: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/28.jpg)
@doanduyhai
Joins pipeline with data locality!
28
JOIN WITH DATA LOCALITY
![Page 29: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/29.jpg)
@doanduyhai
Joins pipeline with data locality!
29
ANOTHER ROUND OF SHUFFLING
![Page 30: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/30.jpg)
@doanduyhai
Joins pipeline with data locality!
30
REPARTITION AGAIN FOR CASSANDRA
![Page 31: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/31.jpg)
@doanduyhai
Joins pipeline with data locality!
31
SAVE TO CASSANDRA WITH LOCALITY
![Page 32: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/32.jpg)
@doanduyhai
Perfect data locality scenario!
32
• read localy from Cassandra • use operations that do not require shuffle in Spark (map, filter, …) • repartitionbyCassandraReplica() • à to a table having same partition key as original table • save back into this Cassandra table
Sanitize, validate, normalize, transform data USE CASE
![Page 33: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/33.jpg)
@doanduyhai
Failure handling!Stand-alone cluster
33
C* SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
![Page 34: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/34.jpg)
@doanduyhai
Failure handling!What if 1 node down ? What if 1 node overloaded ?
34
C* SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
![Page 35: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/35.jpg)
@doanduyhai
Failure handling!What if 1 node down ? What if 1 node overloaded ? ☞ Spark masterwill re-assign the job to another worker
35
C* SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
![Page 36: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/36.jpg)
@doanduyhai
Failure handling!
Oh no, my data locality !!!
36
![Page 37: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/37.jpg)
@doanduyhai
Failure handling!
37
![Page 38: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/38.jpg)
@doanduyhai
Data Locality Impl!
38
Remember RDD interface ?
abstract'class'RDD[T](…)'{'' @DeveloperApi'' def'compute(split:'Partition,'context:'TaskContext):'Iterator[T]''' protected'def'getPartitions:'Array[Partition]'' '' protected'def'getPreferredLocations(split:'Partition):'Seq[String]'='Nil'''''''''''''''}'
![Page 39: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/39.jpg)
@doanduyhai
Data Locality Impl!
39
def getPreferredLocations(split: Partition): Cassandra replicas IP address corresponding to this Spark partition
![Page 40: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/40.jpg)
@doanduyhai
Failure handling!If RF > 1 the Spark master chooses the next preferred location, which
is a replica 😎 Tune parameters: ① spark.locality.wait ② spark.locality.wait.process ③ spark.locality.wait.node
40
C* SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
![Page 41: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/41.jpg)
@doanduyhai
val confDC1 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "DC_1_hostnames") .set("spark.cassandra.connection.local_dc", "DC_1") val confDC2 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "DC_2_hostnames") .set("spark.cassandra.connection.local_dc", "DC_2 ") val sc = new SparkContext(confDC1) sc.cassandraTable[Performer](KEYSPACE,PERFORMERS) .map[Performer](???) .saveToCassandra(KEYSPACE,PERFORMERS) (CassandraConnector(confDC2),implicitly[RowWriterFactory[Performer]])
Cross-DC operations!
41
![Page 42: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/42.jpg)
@doanduyhai
val confCluster1 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "cluster_1_hostnames") val confCluster2 = new SparkConf(true) .setAppName("data_migration") .setMaster("master_ip") .set("spark.cassandra.connection.host", "cluster_2_hostnames") val sc = new SparkContext(confCluster1) sc.cassandraTable[Performer](KEYSPACE,PERFORMERS) .map[Performer](???) .saveToCassandra(KEYSPACE,PERFORMERS) (CassandraConnector(confCluster2),implicitly[RowWriterFactory[Performer]])
Cross-cluster operations!
42
![Page 43: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/43.jpg)
Q & R
! " !
![Page 44: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/44.jpg)
Spark/Cassandra use-cases!
Data cleaning!Schema migration!
Analytics!!
![Page 45: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/45.jpg)
@doanduyhai
Use Cases!
Load data from various sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize, transform data
Schema migration, Data conversion
45
![Page 46: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/46.jpg)
@doanduyhai
Data cleaning use-cases!
46
Bug in your application ? Dirty input data ? ☞ Spark job to clean it up! (perfect data locality)
Sanitize, validate, normalize, transform data
![Page 47: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/47.jpg)
Data Cleaning
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 48: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/48.jpg)
@doanduyhai
Schema migration use-cases!
48
Business requirements change with time ? Current data model no longer relevant ? ☞ Spark job to migrate data !
Schema migration, Data conversion
![Page 49: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/49.jpg)
Data Migration
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 50: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/50.jpg)
@doanduyhai
Analytics use-cases!
50
Given existing tables of performers and albums, I want:
① top 10 most common music styles (pop,rock, RnB, …) ?
② performer productivity(albums count) by origin country and by decade ?
☞ Spark job to compute analytics !
Analytics (join, aggregate, transform, …)
![Page 51: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/51.jpg)
@doanduyhai
Analytics pipeline!
51
① Read from production transactional tables
② Perform aggregation with Spark
③ Save back data into dedicated tables for fast visualization
④ Repeat step ①
![Page 52: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/52.jpg)
Analytics
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 53: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/53.jpg)
Q & R
! " !
![Page 54: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/54.jpg)
The “Big Data Platform”!
![Page 55: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/55.jpg)
@doanduyhai
Our vision!
55
We had a dream …
to provide a Big Data Platform … built for the performance & high availabitly demands
of IoT, web & mobile applications
![Page 56: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/56.jpg)
@doanduyhai
Until now in Datastax Enterprise!Cassandra + Solr in same JVM Unlock full text search power for Cassandra CQL syntax extension
56
SELECT * FROM users WHERE solr_query = ‘age:[33 TO *] AND gender:male’;
SELECT * FROM users WHERE solr_query = ‘lastname:*schwei?er’;
![Page 57: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/57.jpg)
Cassandra + Solr
![Page 58: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/58.jpg)
@doanduyhai
Now with Spark!Cassandra + Spark Unlock full analytics power for Cassandra Spark/Cassandra connector
58
![Page 59: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/59.jpg)
@doanduyhai
Tomorrow (DSE 4.7)!Cassandra + Spark + Solr Unlock full text search + analytics power for Cassandra
59
![Page 60: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/60.jpg)
@doanduyhai
Tomorrow (DSE 4.7)!Cassandra + Spark + Solr Unlock full text search + analytics power for Cassandra
60
![Page 61: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/61.jpg)
@doanduyhai
The idea!① Filter the maximum with Cassandra and Solr
② Fetch only small data set in memory
③ Aggregate with Spark ☞ near real time interactive analytics query possible if restrictive criteria
61
![Page 62: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/62.jpg)
@doanduyhai
Datastax Entreprise 4.7!With a 3rd component for full text search …
how to preserve data locality ?
62
![Page 63: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/63.jpg)
@doanduyhai
Stand-alone search cluster caveat!The bad way v1: perform search from the Spark « driver program »
63
![Page 64: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/64.jpg)
@doanduyhai
Stand-alone search cluster caveat!The bad way v2: search from Spark workers with restricted routing
64
![Page 65: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/65.jpg)
@doanduyhai
Stand-alone search cluster caveat!The bad way v3: search from Cassandra nodes with a connector to Solr
65
![Page 66: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/66.jpg)
@doanduyhai
Stand-alone search cluster caveat!The ops won’t be your friend • 3 clusters to manage: Spark, Cassandra & Solr/whatever • lots of moving parts Impacts on Spark jobs • increased response time due to latency • the 99.9 percentile can be very slow
66
![Page 67: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/67.jpg)
@doanduyhai
Datastax Entreprise 4.7!The right way, distributed « local search »
67
![Page 68: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/68.jpg)
@doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
Datastax Entreprise 4.7!
① compute Spark partitions using Cassandra token ranges ② on each partition, use Solr for local data filtering (no distributed query!) ③ fetch data back into Spark for aggregations
68
![Page 69: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/69.jpg)
@doanduyhai
Datastax Entreprise 4.7!
69
SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
Token Range : ] 3X/8, 4X/8]
![Page 70: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/70.jpg)
@doanduyhai
Datastax Entreprise 4.7!Scalable solution: x2 volume à x2 nodes à constant processing time
70
![Page 71: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/71.jpg)
Cassandra + Spark + Solr
https://github.com/doanduyhai/Cassandra-Spark-Demo
![Page 72: Spark cassandra connector.API, Best Practices and Use-Cases](https://reader033.vdocuments.site/reader033/viewer/2022042701/55a929bd1a28abca768b4860/html5/thumbnails/72.jpg)
Q & R
! " !