harnessing big data with spark
TRANSCRIPT
![Page 1: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/1.jpg)
Harnessing Big Data with Spark
Lawrence Spracklen Alpine Data
![Page 2: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/2.jpg)
2
Alpine Data
![Page 3: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/3.jpg)
3
Map Reduce
• Allows the distribution of large data computations across a cluster
• Computations typically composed of a sequence of MR operations
Big Data
Map()
Output
Reduce()
![Page 4: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/4.jpg)
4
MR Performance
• Multiple disk interactions required in EACH MR operation
Map Reduce
![Page 5: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/5.jpg)
5
Performance Hierarchy
0.10GB/s 0.10GB/s 0.60GB/s 80GB/s
100X Read Bandwidth
![Page 6: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/6.jpg)
6
Optimizing MR
• Many companies have significant legacy MR code – Either direct MR or indirect usage via Pig
• A variety of techniques to accelerate MR – Apache Tez – Tachyon or Apache ignite – System ML
![Page 7: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/7.jpg)
7
Spark
• Several significant advancements over MR – Generalizes two stage MR into arbitrary DAGs – Enables in-memory dataset caching – Improved usability
• Reduced disk read/writes delivers significant speedups – Especially for iterative algorithms like ML
![Page 8: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/8.jpg)
8
Perf comparisons
*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
![Page 9: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/9.jpg)
9
Spark Tuning
• Increased reliance on memory introduces greater requirement for tuning
• Need to understand memory requirements for caching
• Significant performance benefits associated with “getting it right”
• Auto-tuning is coming….
![Page 10: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/10.jpg)
10
Optimization opportunities
• Spark delivers improved ML performance using reduced cluster resources
• Enables numerous opportunities – Reduced time to insights – Reduced cluster size – Eliminate subsampling – AutoML
![Page 11: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/11.jpg)
11
AutoML
• Data sets increasingly large and complex • Increasing difficult to intuitively “know” optimal – Feature engineering – Choice of algorithm – Optimize parameterization of algorithm(s)
• Significant manual trial-and-error • Cult of the algorithm
![Page 12: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/12.jpg)
12
Feature Engineering
• Essential for model performance, efficacy, robustness and simplicity – Feature extraction – Feature selection – Feature construction – Feature elimination
• Domain/dataset knowledge is important, but basic automation feasible
![Page 13: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/13.jpg)
13
Algorithm selection
• Select dependent column • Indicate classification or regression • Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”
![Page 14: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/14.jpg)
14
Hyperparameter optimization
• Are the default parameters optimal? • How do I adjust intelligently – Number of trees? Depth of trees? Splitting
criteria?
• Tedious trial and error • Overfitting danger • Intelligent automatic search
![Page 15: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/15.jpg)
15
Algorithm tuning
• Gradient boosted tree parameterization e.g. – # of trees – Maximum tree depth – Loss function – Minimum node split size – Bagging rate – Shrinkage
![Page 16: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/16.jpg)
16
AutoML
Data Set
Alg #1
Alg #2
Alg #3
Alg #N
Alg #1
Alg #N
1)Investigate N ML algorithms
2) Tune top performing algorithms
Feature engineering
Alg #2
Alg #1
Alg #N
2) Feature elimination
![Page 17: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/17.jpg)
17
Spark is for large datasets
*http://datascience.la/benchmarking-random-forest-implementations/
• If your data fits on a single node…. • Other high-performance options exist
*http://haifengl.github.io/smile/index.html
Ru
n t
ime
![Page 18: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/18.jpg)
18
Data set size
• Large data lakes can consist of many small files
• Memory per node increasing rapidly
*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html
![Page 19: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/19.jpg)
19
NVDIMMS
• Driving significant increases in node memory – Up to 10X increase in density
• Coming in late 2016…
![Page 20: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/20.jpg)
20
Hybrid operators
• Time consuming to maintain multiple ML libraries & manually determine optimal choice
• Develop hybrid implementations that automatically choose optimal approach – Data set size – Cluster size – Cluster utilization
![Page 21: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/21.jpg)
21
Single-node performance (1/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
![Page 22: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/22.jpg)
22
Single-node performance (2/2)
*http://www.ayasdi.com/blog/LawrenceSpracklen
![Page 23: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/23.jpg)
23
Operationalization
• What happens after the models are created? • How does the business benefit from the
insights? • Operationalization is frequently the weak link – Operationalizing PowerPoint? – Hand rolled scoring flows
![Page 24: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/24.jpg)
24
PFA
• Portable Format for Analytics (PFA) • Successor to PMML • Significant flexibility in encapsulating complex
data preprocessing
![Page 25: Harnessing Big Data with Spark](https://reader033.vdocuments.site/reader033/viewer/2022042906/589caf731a28abbe4a8b584f/html5/thumbnails/25.jpg)
25
Conclusions
• Spark delivers significant performance improvements over MR – Can introduce more tuning requirements
• Provides an opportunity for AutoML – Automatically determine good solutions
• Understand when its appropriate • Don’t forget about about operationalization