h2o world - benchmarking open source ml platforms - szilard pafka
TRANSCRIPT
![Page 1: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/1.jpg)
Benchmarking Machine Learning Tools for Scalability, Speed and Accuracy
Szilárd Pafka, PhDChief Scientist, Epoch
H2O World Conference, Mountain View Nov 2015
![Page 2: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/2.jpg)
![Page 3: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/3.jpg)
Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using or not any of the methods, tools, results etc. mentioned in this talk. The results presented in this talk should not be considered as any indication whether Epoch is using these methods, tools, results etc. or not.
![Page 4: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/4.jpg)
![Page 5: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/5.jpg)
I usually use other people’s code [...] it is usually not “efficient” (from time budget perspective) to write my own algorithm [...] I can find open source code for what I want to do, and my time is much better spent doing research and feature engineering -- Owen Zhanghttp://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/
![Page 6: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/6.jpg)
Data Size for Supervised Learning
# records:<10M10M-10B>10B
![Page 7: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/7.jpg)
Data Size for Non-Linear Supervised Learning
# records:<1M1M-100M>100M
![Page 8: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/8.jpg)
binary classification, 10M recordsnumeric & categorical features, non-sparse
![Page 9: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/9.jpg)
![Page 10: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/10.jpg)
![Page 11: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/11.jpg)
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
![Page 12: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/12.jpg)
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
![Page 13: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/13.jpg)
![Page 14: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/14.jpg)
![Page 15: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/15.jpg)
- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib
![Page 16: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/16.jpg)
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%
![Page 17: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/17.jpg)
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
![Page 18: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/18.jpg)
- R packages 30%- Python scikit-learn 40%- Vowpal Wabbit 8%- H2O 10%- xgboost 8%- Spark MLlib 6%- a few others
![Page 19: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/19.jpg)
EC2
![Page 20: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/20.jpg)
Distributed computation generally is hard, because it adds an additional layer of complexity and [network] communication overhead. The ideal case is scaling linearly with the number of nodes; that’s rarely the case. Emerging evidence shows that very often, one big machine, or even a laptop, outperforms a cluster.http://fastml.com/the-emperors-new-clothes-distributed-machine-learning/
![Page 21: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/21.jpg)
![Page 22: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/22.jpg)
n = 10K, 100K, 1M, 10M, 100M
Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data
![Page 23: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/23.jpg)
![Page 24: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/24.jpg)
![Page 25: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/25.jpg)
![Page 26: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/26.jpg)
![Page 27: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/27.jpg)
linear tops off(data size)
(accuracy)
![Page 28: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/28.jpg)
linear tops off
more data & better algo
(data size)
(accuracy)
![Page 29: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/29.jpg)
linear tops off
more data & better algo
random forest on 1% of data beats linear on all data
(data size)
(accuracy)
![Page 30: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/30.jpg)
![Page 31: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/31.jpg)
10x
![Page 32: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/32.jpg)
![Page 33: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/33.jpg)
![Page 34: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/34.jpg)
![Page 35: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/35.jpg)
![Page 36: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/36.jpg)
http://datascience.la/benchmarking-random-forest-implementations/#comment-53599
![Page 37: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/37.jpg)
![Page 38: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/38.jpg)
![Page 39: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/39.jpg)
![Page 40: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/40.jpg)
![Page 41: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/41.jpg)
![Page 42: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/42.jpg)
![Page 43: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/43.jpg)
![Page 44: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/44.jpg)
we will continue to run large [...] jobs to scan petabytes of [...] data to extract interesting features, but this paper explores the interesting possibility of switching over to a multi-core, shared-memory system for efficient execution on more refined datasets [...] e.g., machine learning http://openproceedings.org/2014/conf/edbt/KumarGDL14.pdf
![Page 45: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/45.jpg)
![Page 46: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/46.jpg)
![Page 47: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/47.jpg)
learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000
![Page 48: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/48.jpg)
![Page 49: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/49.jpg)
![Page 50: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/50.jpg)
![Page 51: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/51.jpg)
![Page 52: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/52.jpg)
![Page 53: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/53.jpg)
![Page 54: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/54.jpg)
Non-Linear Supervised Learning
![Page 55: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/55.jpg)
# records:<1M1M-100M>100M
Non-Linear Supervised Learning
![Page 56: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/56.jpg)
![Page 57: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/57.jpg)
![Page 58: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/58.jpg)
![Page 59: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/59.jpg)
![Page 60: H2O World - Benchmarking Open Source ML Platforms - Szilard Pafka](https://reader031.vdocuments.site/reader031/viewer/2022022413/58ed82ea1a28ab1f1f8b4617/html5/thumbnails/60.jpg)