spark after dark: real time advanced analytics and machine learning with spark
TRANSCRIPT
![Page 1: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/1.jpg)
After DarkGenerating High-Quality Recommendations usingReal-time Advanced Analytics and Machine Learning with
Chris [email protected]
![Page 2: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/2.jpg)
Who am I?Streaming Platform Engineer
Streaming Data EngineerNetflix Open Source Committer
Data Solutions EngineerApache Spark Contributor
Spark AuthorConsultant, Trainer
2
advancedspark.com
![Page 3: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/3.jpg)
Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!3
![Page 4: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/4.jpg)
What is ?
4
Spark Core
Spark Streaming
real-timeSpark SQLstructured data
MLlibmachine learning
GraphXgraph
analytics
…
BlinkDBapprox queries
![Page 5: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/5.jpg)
in Production
5
![Page 6: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/6.jpg)
What is ?
6
Founded by the creators of
as a ServiceAmazon AWS based
Powerful VisualizationsCollaborative Notebooks
Scala/Java, Python, SQL, RFlexible Cluster Management
Job Scheduling and Monitoring
![Page 7: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/7.jpg)
7
① Generate high-quality recommendations② Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates④ Spark SQL -> DataFrames, Cassandra① GraphX -> PageRank, Shortest Path① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!
![Page 8: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/8.jpg)
Popular Dating Sites
8
![Page 9: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/9.jpg)
Focus of This Talk
9
① Parallelism② Performance③ Real-time Streaming④ Approximations⑤ Similarity Measures
Spark and…
![Page 10: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/10.jpg)
Parallelism
10
![Page 11: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/11.jpg)
Brady Bunch circa 1980
11
Season 5, Episode 18: “Two Petes in a Pod”
![Page 12: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/12.jpg)
Parallel Algorithm : O(log n)
12
![Page 13: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/13.jpg)
Non-parallel Algorithm : O(n)
13
![Page 14: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/14.jpg)
Spark is Parallel
14
![Page 15: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/15.jpg)
Performance
15
![Page 16: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/16.jpg)
Daytona Gray Sort Contest
16
On-disk only250,000 partitions
No in-memory caching
(2014)(2013) (2014)
![Page 17: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/17.jpg)
Improved Shuffle and Network Layer
17
① “Sort-based shuffle”② Minimize OS resources③ Switched to async Netty④ Keep CPUs hot ⑤ Reuse byte buffers to minimize GC⑥ Use epoll for I/O to stay in kernel space
![Page 18: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/18.jpg)
Project Tungsten: CPU and Memory
18
① More JVM bytecode generation, JIT optimize② CPU-cache-aware data structs and algos
->
③ Custom memory managementSerializers HashMap
![Page 19: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/19.jpg)
DataFrames and Catalyst
19
19
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
PleaseUse DataFrames!!
-->
JVM bytecode generation
![Page 20: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/20.jpg)
Columnar Storage Format
20
*Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)
![Page 21: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/21.jpg)
Parquet File Format
21
① Based on Google Dremel Paper② Implemented by Twitter and Cloudera③ Columnar storage format④ Optimized for fast columnar aggregations⑤ Tight compression⑥ Supports pushdowns⑦ Nested, self-describing, evolving schema
![Page 22: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/22.jpg)
Types of Compression
22
① Run Length EncodingRepeated data
② Dictionary EncodingFixed set of values
③ Delta, Prefix EncodingSorted dataset
![Page 23: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/23.jpg)
Types of Pushdowns
23
① Column, Partition Pruning② Row, Predicate Filtering
![Page 24: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/24.jpg)
Real-time Streaming
24
![Page 25: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/25.jpg)
Direct Kafka Streaming (KafkaRDD)① No single Receiver, no Write Ahead Log (WAL)② Workers pull from Kafka in parallel③ Each KafkaRDD partition stores relevant offsets④ Upon Worker Node failure, rebuild from offsets⑤ Optimizes happy path by avoiding the WAL
25
At least oncedelivery guarantee
<--
![Page 26: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/26.jpg)
Approximations
26
![Page 27: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/27.jpg)
Count Min Sketch
27
① Approximate counters② Better than HashMap③ Low, fixed memory④ Known error bounds⑤ Large num of counters⑥ Available in Twitter’s Algebird⑦ Streaming example in Spark codebase
![Page 28: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/28.jpg)
HyperLogLog
28
① Measures set cardinalityApprox count distinct
② Low memory1.5KB @ 2% error10^9 elements!
③ From Twitter’s Algebird④ Streaming example in Spark codebase⑤ RDD: countApproxDistinctByKey()
![Page 29: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/29.jpg)
10 Recommendations
29
![Page 30: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/30.jpg)
Types of Recommendations
30
① Non-personalized (2 out of 10) Cold Start
No preference or behavior data for user, yet② Personalized (8 out of 10)
User-Item Similarity Items that others with similar prefs have
likedItem-Item Similarity Items similar to your previously-liked items
![Page 31: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/31.jpg)
Interactive Demo!
31
![Page 32: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/32.jpg)
Audience Participation Needed!
32
① Navigate to sparkafterdark.com
② Click 3 actors and 3 actresses
->You are here
->
![Page 33: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/33.jpg)
Non-personalized Recommendations
33
![Page 34: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/34.jpg)
Summary Statistics and Aggregations
34
① Top Users by Like Count“I might like users with the highest sum aggregation of likes overall.”
SparkSQL + DataFrame: Aggregations
![Page 35: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/35.jpg)
Like Graph Analysis
35
② Top Influencers by Like Graph“I might like users who have the highest probability of me liking them randomly while walking the like graph.”
GraphX: PageRank
![Page 36: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/36.jpg)
Demo! Spark SQL + DataFrames + GraphX
36
![Page 37: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/37.jpg)
Similarity Measures
37
![Page 38: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/38.jpg)
Types of Similarity
38
① Euclidean: linear measureMagnitude bias
② Cosine: angle measureAdjust for magnitude bias
③ Jaccard: Set intersection divided by unionPopularity bias
④ Log LikelihoodAdjust for pop. bias
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z
![Page 39: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/39.jpg)
All-pairs Similarity Measure
39
① Compare everything to everything② aka. “pair-wise similarity” or “similarity join”③ Naïve shuffle: O(m*n^2); m=rows, n=cols④ Minimize shuffle: reduce data size & approx
Reduce m (rows)Sampling and bucketing
Reduce n (cols)Remove most frequent value (0?)
![Page 40: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/40.jpg)
Sampling Algo: DIMSUM
40
① "Dimension Independent Matrix Square Using MR”
② Remove rows with low similarity probability③ MLlib: RowMatrix.columnSimilarities(…)④ Twitter: 40% efficiency gain over Cosine
![Page 41: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/41.jpg)
Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algoRequires pre-processing of data
② Compare bucket contents in parallel③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50⑤ github.com/mrsqueeze/spark-hash
![Page 42: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/42.jpg)
MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0
![Page 43: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/43.jpg)
Personalized Recommendations
43
![Page 44: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/44.jpg)
Personalized Recommendation Terms
44
① UserUser seeking likeable recommendations
② ItemUser who has been liked*Also a user seeking likeable recommendations!
③ Types of FeedbackExplicit: rating, likeImplicit: search, click, hover, view, scroll
![Page 45: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/45.jpg)
Collaborative Filtering Personalized Recs
45
③ Like behavior of similar users“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
![Page 46: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/46.jpg)
Text-based Personalized Recs
46
④ Similar profiles to each other“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
![Page 47: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/47.jpg)
More Text-based Personalized Recs
47
⑤ Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
![Page 48: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/48.jpg)
More Text-based Personalized Recs
48
⑥ Relevant, High-Value Emails“Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition
^ Her Email< My Profile
![Page 49: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/49.jpg)
Personalized Recommendations:The Future
49
![Page 50: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/50.jpg)
Facial Recognition
50
⑦ Eigenfaces“Your face looks similar to others that I’ve liked. I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
![Page 51: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/51.jpg)
Conversation Starter Bot
51
⑧ NLP and DecisionTrees“If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positiveresponse ->
Negative <- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
![Page 52: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/52.jpg)
52
Maintaining the
![Page 53: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/53.jpg)
Compromise Recommendations (Couples)
53
⑨ Pathway of Similarity“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path
similar similar plots -> <- actors
… …
![Page 54: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/54.jpg)
54
⑩ The Final Recommendation
![Page 55: Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark](https://reader038.vdocuments.site/reader038/viewer/2022103019/55beca2abb61eb3a248b4689/html5/thumbnails/55.jpg)
⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfreglygithub.com/[email protected]@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/Free trial at databricks.comTry !!