no-bullshit data science
TRANSCRIPT
No-Bullshit Data Science
Szilárd Pafka, PhDChief Scientist, Epoch
Domino Data Science PopupSan Francisco, Feb 2017
Disclaimer:
I am not representing my employer (Epoch) in this talk
I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
Example #1
Aggregation 100M rows 1M groups Join 100M rows x 1M rows
time [s]
time [s]
(largest data analyzed)
(largest data analyzed)
(largest data analyzed)
data size [M]
trainingtime [s]
10x
Gradient Boosting Machines
linear tops off(data size)
(accuracy)
linear tops off
more data & better algo
(data size)
(accuracy)
linear tops off
more data & better algorandom forest on 1% of data beats linear on all data
(data size)
(accuracy)
linear tops off
more data & better algorandom forest on 1% of data beats linear on all data
(data size)
(accuracy)
Example #2
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf
http://lowrank.net/nikos/pubs/empirical.pdf
- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others
- R packages- Python scikit-learn- Vowpal Wabbit- H2O- xgboost- Spark MLlib- a few others
EC2
n = 10K, 100K, 1M, 10M, 100M
Training timeRAM usageAUCCPU % by coreread data, pre-process, score test data
10x
Best linear: 71.1
learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000
...
Summary