distributed computing using...
TRANSCRIPT
![Page 1: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/1.jpg)
Albert-Ludwigs-Universität Freiburg
Practical / Praktikum WS17/18
October 18th, 2017
Distributed Computing Using Spark
Prof. Dr. Georg Lausen
Anas Alzogbi
Victor Anthony Arrascue Ayala
![Page 2: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/2.jpg)
Agenda
Introduction to Spark
Case-study: Recommender system for scientific papers
Organization
Hands-on session
18.10.2017 Distributed Computing Using Spark WS17/18 2
![Page 3: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/3.jpg)
Agenda
Introduction to Spark
Case-study: Recommender system for scientific papers
Organization
Hands-on session
18.10.2017 Distributed Computing Using Spark WS17/18 3
![Page 4: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/4.jpg)
Introduction to Spark
Distributed programming
MapReduce
Spark
18.10.2017 Distributed Computing Using Spark WS17/18 4
![Page 5: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/5.jpg)
Distributed programming - problem
Data grows faster than processing capabilities
- Web 2.0: users generate content
- Social networks, online communities, etc.
18.10.2017 Distributed Computing Using Spark WS17/18 5
Source: https://www.flickr.com/photos/will-lion/2595497078
![Page 6: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/6.jpg)
Big Data
18.10.2017 Distributed Computing Using Spark WS17/18 6
Source: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0/
Source: http://www.bigdata-startups.com/open-source-tools/
![Page 7: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/7.jpg)
Big Data
Buzzword
Often less-structured
Requires different techniques, tools, approaches- To solve new problems or old ones in a better way
18.10.2017 Distributed Computing Using Spark WS17/18 7
![Page 8: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/8.jpg)
Network Programming Models
Requires a communication protocol for programming parallel computers (slow)- MPI (wiki)
Locality of the data and the code across the network have to be done manually
No failure management
Network problems not solved (e.g. stragglers)
18.10.2017 Distributed Computing Using Spark WS17/18 8
![Page 9: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/9.jpg)
Data Flow Models
Higher-level of abstraction: algorithms are parallelized on large clusters
Fault-recovery by means of data replication
Job divided into a set of independent tasks
- Code is shipped to where the data is located
Good scalability
18.10.2017 Distributed Computing Using Spark WS17/18 9
![Page 10: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/10.jpg)
MapReduce – Key ideas
1. Problem is split into smaller problems (map step)
2. Smaller problems are solved in a parallel fashion
3. Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
18.10.2017 Distributed Computing Using Spark WS17/18 10
![Page 11: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/11.jpg)
MapReduce – Overview
18.10.2017 Distributed Computing Using Spark WS17/18 11
split 1
split 0 Map
Map
Map
Reduce
Reduce
output 0
output 1
<k,v> Data
split 2
Input Data
…
A target problem has to be parallelizable!!!
![Page 12: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/12.jpg)
MapReduce – Wordcount example
18.10.2017 Distributed Computing Using Spark WS17/18 12
Google Maps charts new territory into businesses
Google selling new tools for businesses to build their own maps
Google promises consumer experience for businesses with Maps Engine Pro
Google is trying to get its Maps service used by more businesses
Google 4
Maps 4
Businesses 4
Engine 1
Charts 1
Territory 1
Tools 1
…
![Page 13: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/13.jpg)
MapReduce – Wordcount’s map
18.10.2017 Distributed Computing Using Spark WS17/18 13
Google Maps charts new territory into businesses
Google selling new tools for businesses to build their own maps
Google promises consumer experience for businesses with Maps Engine Pro
Google is trying to get its Maps service used by more businesses
Map
Map
Google 2
Charts 1
Maps 2
Territory 1
…
Google 2
Businesses 2
Maps 2
Service 1
…
![Page 14: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/14.jpg)
MapReduce – Wordcount’s map
18.10.2017 Distributed Computing Using Spark WS17/18 14
Google Maps charts new territory into businesses
Google selling new tools for businesses to build their own maps
Google promises consumer experience for businesses with Maps Engine Pro
Google is trying to get its Maps service used by more businesses
Map
Map
Google 2
Charts 1
Maps 2
Territory 1
…
Google 2
Businesses 2
Maps 2
Service 1
…
![Page 15: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/15.jpg)
MapReduce – Wordcount’s reduce
18.10.2017 Distributed Computing Using Spark WS17/18 15
Reduce
Reduce
Google 2
Google 2
Maps 2
Maps 2
…
Businesses 2
Businesses 2
Charts 1
Territory 1
…
Google 4
Maps 4
…
Businesses 4
Charts 1
Territory 1
…
![Page 16: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/16.jpg)
MapReduce – Wordcount’s reduce
18.10.2017 Distributed Computing Using Spark WS17/18 16
Reduce
Reduce
Google 2
Google 2
Maps 2
Maps 2
…
Businesses 2
Businesses 2
Charts 1
Territory 1
…
Google 4
Maps 4
…
Businesses 4
Charts 1
Territory 1
…
![Page 17: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/17.jpg)
MapReduce
Automatic
- Partition and distribution of data
- Parallelization and assignment of tasks
- Scalability, fault-tolerance, scheduling
18.10.2017 Distributed Computing Using Spark WS17/18 17
![Page 18: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/18.jpg)
Apache Hadoop
Open-source implementation of MapReduce
18.10.2017 Distributed Computing Using Spark WS17/18 18
So
urc
e: h
ttp
://w
ww
.bo
go
tob
og
o.c
om
/Ha
do
op
/Big
Da
ta_h
ad
oo
p_E
cosy
ste
m.p
hp
![Page 19: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/19.jpg)
MapReduce – Parallelizable algorithms
Matrix-vector multiplication
Power iteration (e.g. PageRank)
Gradient descent methods
Stochastic SVD
Matrix Factorization (Tall skinny QR)
etc…
18.10.2017 Distributed Computing Using Spark WS17/18 19
![Page 20: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/20.jpg)
MapReduce – Limitations
Inefficient for multi-pass algorithms
No efficient primitives for data sharing
State between steps is materialized and distributed
Slow due to replication and storage
18.10.2017 Distributed Computing Using Spark WS17/18 20
Source: http://stanford.edu/~rezab/sparkclass
![Page 21: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/21.jpg)
Limitations – PageRank
Requires iterations of multiplications of sparse matrix and vector
18.10.2017 Distributed Computing Using Spark WS17/18 21
Source: http://stanford.edu/~rezab/sparkclass
![Page 22: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/22.jpg)
Limitations – PageRank
MapReduce sometime requires asymptotically more communication or I/O
Iterations are handled very poorly
Reading and writing to disk is a bottleneck
- In some cases 90% of time is spent on I/O
18.10.2017 Distributed Computing Using Spark WS17/18 22
![Page 23: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/23.jpg)
Spark Processing Framework
Developed in 2009 in UC Berkeley’s
In 2010 open sourced at Apache
- Most active big data community
- Industrial contributions: over 50 companies
Written in Scala
- Good at serializing closures
Clean APIs in Java, Scala, Python, R
18.10.2017 Distributed Computing Using Spark WS17/18 23
![Page 24: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/24.jpg)
Spark Processing Framework
18.10.2017 Distributed Computing Using Spark WS17/18 24
Contributors (2014)
![Page 25: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/25.jpg)
Spark – High Level Architecture
18.10.2017 Distributed Computing Using Spark WS17/18 25
HD
FS
Source: https://mapr.com/ebooks/spark/
![Page 26: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/26.jpg)
Spark - Running modes
Local mode: for debugging
Cluster mode
- Standalone mode
- Apache Mesos
- Hadoop Yarn
18.10.2017 Distributed Computing Using Spark WS17/18 26
![Page 27: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/27.jpg)
Spark – Programming model
Spark context: the entry point
Spark Session: since Spark 2.0- New unified entry point. It combines SQLContext,
HiveContext and future StreamingContex
Spark Conf: to initialize the context
Spark’s interactive shell- Scala: spark-shell
- Python: pyspark
18.10.2017 Distributed Computing Using Spark WS17/18 27
![Page 28: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/28.jpg)
Spark – RDDs, the game changer
Resilient distributed datasets
A typed data-structure (RDD[T]) that is not language specific
Each element of type T is stored locally on a machine
- It has to fit in memory
An RDD can be cached in memory
18.10.2017 Distributed Computing Using Spark WS17/18 28
![Page 29: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/29.jpg)
Resilient Distributed Datasets
Immutable collections of objects, spread across cluster
User controlled partitioning and storage
Automatically rebuilt on failure
RDDs are replaced by Dataset, which is strongly-typed like an RDD (Spark > 2.0)
18.10.2017 Distributed Computing Using Spark WS17/18 29
![Page 30: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/30.jpg)
Spark – Wordcount example
text_file = sc.textFile("...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("...")
18.10.2017 Distributed Computing Using Spark WS17/18 30
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
![Page 31: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/31.jpg)
Spark – Data manipulation
Transformations: always yield a new RDD instance (RDDs are immutable)
- filter, map, flatMap, etc.
Actions: triggers a computation on the RDD’s elements
- count, foreach, etc.
Lazy evaluation of transformations
18.10.2017 Distributed Computing Using Spark WS17/18 31
![Page 32: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/32.jpg)
Spark – DataFrames
DataFrame API introduced since Spark 1.3
Handles table-like representation with named columns and declared column types
Do not confuse with Python’s Pandas DataFrames
DataFrames translate SQL code into RDD low-level operations
Since Spark 2.0, DataFrame is implemented as a special case of DataSet
18.10.2017 Distributed Computing Using Spark WS17/18 32
![Page 33: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/33.jpg)
DataFrames – How to create DFs
1. Convert existing RDDs
2. Running SQL queries
3. Loading external data
18.10.2017 Distributed Computing Using Spark WS17/18 33
![Page 34: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/34.jpg)
Spark SQL
SQL context
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
18.10.2017 Distributed Computing Using Spark WS17/18 34
// Run SQL statements. Returns a DataFrame
students = sqlContext.sql( "SELECT name FROM people WHERE occupation>=‘student’)
![Page 35: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/35.jpg)
Spark – DataFrames
18.10.2017 Distributed Computing Using Spark WS17/18 35
So
urc
e: S
pa
rk in
Act
ion
(b
oo
k, s
ee
lit
era
ture
)
![Page 36: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/36.jpg)
Machine Learning (ML) with Spark
ML project steps1. Data collection
2. Data cleaning and preparation
3. Data analysis and feature extraction
4. Model training
5. Model evaluation
6. Model application
18.10.2017 Distributed Computing Using Spark WS17/18 36
So
urc
e:
Sp
ark
in
Act
ion
(b
oo
k, s
ee
lit
era
ture
)
![Page 37: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/37.jpg)
Machine Learning (ML) with Spark
ML with Spark- Perfect for ML parallelizable algorithms!!
- A single platform (the same system and the same API) for performing most tasks:
• Collect, prepare, analyze the data
• Train, evaluate, use the model
- Training and applying ML algorithms on very large datasets
- Offer most of the popular ML algorithms
18.10.2017 Distributed Computing Using Spark WS17/18 37
![Page 38: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/38.jpg)
Machine Learning (ML) with Spark
MLlib- Spark’s machine learning library
- Provides a generalized API for training and tuning different algorithms in the same way (influenced by scikit-learn)
- Relies on several low-level libraries for performing optimized linear algebra operations:
• Breeze, jblas for Scala and java
• NumPy for Python
18.10.2017 Distributed Computing Using Spark WS17/18 38
![Page 39: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/39.jpg)
Machine Learning (ML) with Spark
MLlib two APIs- RDD-based API
• Will be removed in Spark 3.0 (spark.mllib)
- Dataframe-based API, will keep add new features (spark.ml)
• More user-friendly API than RDDs
• A uniform API across ML algorithms and across multiple languages
• Facilitate practical ML Pipelines (feature transformations)
18.10.2017 Distributed Computing Using Spark WS17/18 39
![Page 40: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/40.jpg)
MLlib abstractions
Transformer- Main method: transform- Examples:
• ML model• Feature transformer
Estimator- main method: fit- Example: ML algorithm
Evaluator- Example: RMSE metric
18.10.2017 Distributed Computing Using Spark WS17/18 40
Estimator Transformer EvaluatorFit
Input
dataset
Evaluation
results
Transforme
d dataset
Estimate
Tra
nsfo
rm
Source: Spark in Action (book, see literature)
![Page 41: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/41.jpg)
A pipeline chains multiple Transformers and Estimators together to specify an ML workflow
ExampleLearn a prediction model using features extracted from text document
Training phase
MLlib Pipelines
18.10.2017 Distributed Computing Using Spark WS17/18 41
Source: http://spark.apache.org/docs/latest/ml-pipeline.html#properties-of-pipeline-components
Test phase
![Page 42: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/42.jpg)
Organization
Introduction Introduction to Spark
Case-study: Recommender system for scientific papers
Organization
Hands-on session
18.10.2017 Distributed Computing Using Spark WS17/18 42
![Page 43: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/43.jpg)
Case-study: Recommender system for scientific papers
Motivation- Recommend relevant papers to users
Dataset- Set of papers (~172 K)
• Textual content: Title + abstract
• Attributes: type, journal, pages, year,…
- Set of users (~ 28 K)
- Ratings (~ 828 K ratings)
18.10.2017 Distributed Computing Using Spark WS17/18 43
![Page 44: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/44.jpg)
Organization
Introduction Introduction to Spark
Case-study: Recommender system for scientific papers
Organization
Hands-on session
18.10.2017 Distributed Computing Using Spark WS17/18 44
![Page 45: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/45.jpg)
Organization
Team
Educational goals
Requirements
ILIAS
Experiments’ submissions
Assessment
Discussion with the tutors
Schedule
18.10.2017 Distributed Computing Using Spark WS17/18 45
![Page 46: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/46.jpg)
Team
Prof. Georg Lausen
Assistants
- Anas
- Anthony
Tutors
- Polina Koleva
- Matteo Cossu
18.10.2017 Distributed Computing Using Spark WS17/18 46
![Page 47: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/47.jpg)
Educational goals
Distributed programming paradigm
Recommender Systems (use case)
Theoretical and practical training
- Master project and thesis
Data Science profile for work market
18.10.2017 Distributed Computing Using Spark WS17/18 47
![Page 48: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/48.jpg)
Requirements
Mandatory
- Registration via HisInOne
- Attendance to Kick-off meeting
Recommended
- Attendance of DAQL, SIDS or ML lectures
- Basics In Python programming
18.10.2017 Distributed Computing Using Spark WS17/18 48
![Page 49: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/49.jpg)
ILIAS
Distributed Computing Using Spark -WS1718https://ilias.uni-freiburg.de/goto.php?target=crs_878841
Access with course password
Forum for clarification questions of tasks
- Do not post solutions or suggestions
18.10.2017 Distributed Computing Using Spark WS17/18 49
![Page 50: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/50.jpg)
Experiments’ submissions
6 experiments, 2-3 weeks of working time
Submissions in groups of 2 students (Form your group)
Submissions via ILIAS
18.10.2017 Distributed Computing Using Spark WS17/18 50
![Page 51: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/51.jpg)
Assessment
Each experiment: 50 points. Overall 300 points.
At least 70% of the points required to pass
Corrections done by tutors
18.10.2017 Distributed Computing Using Spark WS17/18 51
![Page 52: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/52.jpg)
Discussion of solutions with tutors
Mandatory attendance
Each member has to be able to explain all tasks!- 0 points for that task
Copied solutions- First time: 0 points for that experiment
- Second time: failure of the practical
18.10.2017 Distributed Computing Using Spark WS17/18 52
![Page 53: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/53.jpg)
Schedule
18.10.2017 Distributed Computing Using Spark WS17/18 53
Experiment Content Release Submission Discussion
1.Familiarizing with Tools, Loading Data, and Basic Analysis of Data
18.10.2017 01.11.2017, 11h 08.11.2017
2. Experiment 2 01.11.2017 15.11.2017, 11h 22.11.2017
3. Experiment 3 15.11.2017 29.11.2017, 11h 06.12.2017
4. Experiment 4 29.11.2017 13.12.2017, 11h 20.12.2017
5. Experiment 5 13.12.2017 10.01.2018, 11h 17.01.2018
6. Experiment 6 10.01.2018 31.01.2018, 11h 07.02.2018
![Page 54: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/54.jpg)
Literature
Spark in Action [book] by Petar Zečević Marko Bonaći
Machine Learning with Spark [book] by Nick Pentreath
Apache Spark documentation:http://spark.apache.org/docs/latest
18.10.2017 Distributed Computing Using Spark WS17/18 54
![Page 55: Distributed Computing Using Sparkdbis.informatik.uni-freiburg.de/content/courses/WS1718/Praktikum... · Albert-Ludwigs-Universität Freiburg Practical / Praktikum WS17/18 October](https://reader030.vdocuments.site/reader030/viewer/2022041221/5e0afa2e917f92348f0ae8ac/html5/thumbnails/55.jpg)
Organization
Introduction to Spark
Case-study: Recommender system for scientific papers
Organization
Hands-on session
18.10.2017 Distributed Computing Using Spark WS17/18 55