harnessing big data with spark

Harnessing Big Data with Spark

Lawrence Spracklen Alpine Data

2

Alpine Data

3

Map Reduce

•  Allows the distribution of large data computations across a cluster

•  Computations typically composed of a sequence of MR operations

Big Data

Map()

Output

Reduce()

4

MR Performance

•  Multiple disk interactions required in EACH MR operation

Map Reduce

5

Performance Hierarchy

0.10GB/s 0.10GB/s 0.60GB/s 80GB/s

100X Read Bandwidth

6

Optimizing MR

•  Many companies have significant legacy MR code –  Either direct MR or indirect usage via Pig

•  A variety of techniques to accelerate MR –  Apache Tez –  Tachyon or Apache ignite –  System ML

7

Spark

•  Several significant advancements over MR –  Generalizes two stage MR into arbitrary DAGs –  Enables in-memory dataset caching –  Improved usability

•  Reduced disk read/writes delivers significant speedups –  Especially for iterative algorithms like ML

8

Perf comparisons

*http://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce

9

Spark Tuning

•  Increased reliance on memory introduces greater requirement for tuning

•  Need to understand memory requirements for caching

•  Significant performance benefits associated with “getting it right”

•  Auto-tuning is coming….

10

Optimization opportunities

•  Spark delivers improved ML performance using reduced cluster resources

•  Enables numerous opportunities –  Reduced time to insights –  Reduced cluster size –  Eliminate subsampling –  AutoML

11

AutoML

•  Data sets increasingly large and complex •  Increasing difficult to intuitively “know” optimal –  Feature engineering –  Choice of algorithm –  Optimize parameterization of algorithm(s)

•  Significant manual trial-and-error •  Cult of the algorithm

12

Feature Engineering

•  Essential for model performance, efficacy, robustness and simplicity –  Feature extraction –  Feature selection –  Feature construction –  Feature elimination

•  Domain/dataset knowledge is important, but basic automation feasible

13

Algorithm selection

•  Select dependent column •  Indicate classification or regression •  Press “go” Algorithms run in parallel across cluster Minimally provides good starting point Significantly reduces “busy work”

14

Hyperparameter optimization

•  Are the default parameters optimal? •  How do I adjust intelligently –  Number of trees? Depth of trees? Splitting

criteria?

•  Tedious trial and error •  Overfitting danger •  Intelligent automatic search

15

Algorithm tuning

•  Gradient boosted tree parameterization e.g. –  # of trees –  Maximum tree depth –  Loss function –  Minimum node split size –  Bagging rate –  Shrinkage

16

AutoML

Data Set

Alg #1

Alg #2

Alg #3

Alg #N

Alg #1

Alg #N

1)Investigate N ML algorithms

2) Tune top performing algorithms

Feature engineering

Alg #2

Alg #1

Alg #N

2) Feature elimination

17

Spark is for large datasets

*http://datascience.la/benchmarking-random-forest-implementations/

•  If your data fits on a single node…. •  Other high-performance options exist

*http://haifengl.github.io/smile/index.html

Ru

n t

ime

18

Data set size

•  Large data lakes can consist of many small files

•  Memory per node increasing rapidly

*http://www.kdnuggets.com/2015/11/big-ram-big-data-size-datasets.html

19

NVDIMMS

•  Driving significant increases in node memory –  Up to 10X increase in density

•  Coming in late 2016…

20

Hybrid operators

•  Time consuming to maintain multiple ML libraries & manually determine optimal choice

•  Develop hybrid implementations that automatically choose optimal approach –  Data set size –  Cluster size –  Cluster utilization

21

Single-node performance (1/2)

*http://www.ayasdi.com/blog/LawrenceSpracklen

22

Single-node performance (2/2)

*http://www.ayasdi.com/blog/LawrenceSpracklen

23

Operationalization

•  What happens after the models are created? •  How does the business benefit from the

insights? •  Operationalization is frequently the weak link –  Operationalizing PowerPoint? –  Hand rolled scoring flows

24

PFA

•  Portable Format for Analytics (PFA) •  Successor to PMML •  Significant flexibility in encapsulating complex

data preprocessing

25

Conclusions

•  Spark delivers significant performance improvements over MR –  Can introduce more tuning requirements

•  Provides an opportunity for AutoML –  Automatically determine good solutions

•  Understand when its appropriate •  Don’t forget about about operationalization

harnessing big data with spark

Data & Analytics