from kaggle to h2o - the true story of a civil engineer turned data geek

45
From Kaggle to H 2 O The true story of a civil engineer turned data geek Jo-fai (Joe) Chow Data Scientist [email protected] @matlabulous SV Big Data Science at H2O.ai 28 th February, 2017

Upload: jo-fai-chow

Post on 22-Mar-2017

317 views

Category:

Career


0 download

TRANSCRIPT

Page 1: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

From Kaggle to H2OThe true story of a civil engineer turned data geek

Jo-fai (Joe) ChowData [email protected]

@matlabulous

SV Big Data Science at H2O.ai28th February, 2017

Page 2: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

2

About Me• Civil (Water) Engineer

2010 – 2015• Consultant (UK)

• Utilities• Asset Management• Constrained Optimization

• Industrial PhD (UK)• Infrastructure Design Optimization• Machine Learning +

Water Engineering• Discovered H2O in 2014

• Data Scientist2015• Virgin Media (UK)• Domino Data Lab (Silicon Valley)

2016 – Present• H2O.ai (Silicon Valley)

Page 3: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

3

Agenda• My Data Science Journey• Life as a Water Engineer• Massive Open Online Course• Kaggle• New Skills• Side Projects• New Opportunities• Discovery of H2O & Domino

• To Kaggle, or not to Kaggle• Joy, Pain, Fear, Gain …• … and New Friends

• Life as a Data Scientist• Using H2O for Kaggle• Rossmann Store Sales• Santander Products

Recommendation

• Conclusions

Page 4: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

4

Life as a Water Engineer

Page 5: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

5

Joe the Outlier

http://www.h2o.ai/gartner-magic-quadrant/Joe (Water Engineer)

Joe (2015)

Page 6: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

6

Massive Open Online Course (MOOC)

Page 7: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

7

My First MOOC Experience• Introduction to AI (2011)• One of the first MOOCs

• Key messages from Sebastian Thrun:• “Just dive into it.”• “Get your hands dirty.”

• Met new friends• Decided to collaborate for fun• “How about Kaggle?”• “What is Kaggle?”

Page 8: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

8

About Kaggle• World’s biggest data mining

competition platform • Competition Types:• Featured (w/ Prize)• Recruitment• Playground• Beginner (101)

Page 9: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

9

My Very First Kaggle Experience• Predict Bond Trade Price• No domain knowledge• Lots of numbers (I couldn’t open

the CSV in Excel)

• Regression Models• Random Forest• Support Vector Machine• Neural Networks

• Black Magic or Data Science?• Still, I wasn’t so sure

Page 10: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

10

Teamwork• Problems• “Hey Joe, you are a nice guy.”• “… but we can’t work together.”

• “Okay, wait … why?

• “You love MATLAB so much.”• “You even have a fan boy twitter

handle!”

• Problems• “We prefer open source tools like

R or Python.”• “Wait … you guys can use Octave”• “Thanks, but no thanks …”

• Solution• I kept using MATLAB• Lone wolf• ZERO collaboration

Page 11: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

11

Page 12: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

12

Adapt or DieIf you can’t change the world your friends, change yourself.

Page 13: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

13

Identifying Skill Gaps• Obvious Skill Gaps• Open-source Programming

Languages• Machine Learning Techniques• Big Data• Collaboration

• Kind of Related• Data Visualization• Explaining Results

• Where to Start?

https://www.r-bloggers.com/

Page 14: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

14

Cool Things People Created with R

http://www.jofaichow.co.uk/2014_03_11_LondonR/

Page 15: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

15

Learn• More MOOCs

• Machine Learning• Andrew Ng (Coursera)• MATLAB / Octave

• Data Analysis• Jeff Leek (Coursera)• R

• Intro to Programming• Dave Evans (Udacity)• Python

• Kaggle Forums• Tricks you can’t learn from schools/books

• Skills I also picked up• Linux – Ubuntu*• Git (I mean Git with GUI)• Cloud• HTML / CSS

*Ubuntu is an ancient African word that means “I can’t configure Debian.”

Page 16: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

16

Side Project #1 – Crime Data Visualization

https://github.com/woobe/rApps/tree/master/crimemap http://insidebigdata.com/2013/11/30/visualization-week-crimemap/

Page 17: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

17

Side Project #2 – Data Visualization Contest

https://github.com/woobe/rugsmaps http://blog.revolutionanalytics.com/2014/08/winner-for-revolution-analytics-user-group-map-contest.html

Page 18: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

18

Side Project #3 – Color Extraction #TheDress

https://github.com/woobe/rPlotter

http://blog.revolutionanalytics.com/2015/03/color-extraction-with-r.html

Page 19: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

19

Side Project #4 – World Cup 2014 Prediction• Joe (Machine Learning) vs.

Friends• Correct Results (WDL)• ML: 35 / 64 (55%)• Friends (Avg): 29 / 64 (46%)

• Correct Score• ML: 10 / 64 (16%)• Friends (Avg): 4 / 64 (6%)

https://github.com/woobe/wc2014

Page 20: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

20

Open Up Myself

Page 21: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

21

New OpportunitiesR Community, H2O & Domino Data Lab

Page 22: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

22

LondonR 2013 & useR! 2014

http://www.jofaichow.co.uk/2014_03_11_LondonR/ https://github.com/woobe/useR_2014

Page 23: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

23

useR! 2014

Ramnath Vaidyanathanhtmlwidgets

DataRobot

Nick @ DominoDataLab

H2O.ai &John Chambers!

rOpenSciRStudio

Matt Dowledata.table (now at H2O.ai)

Page 24: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

24

R + Domino + H2O https://blog.dominodatalab.com/using-r-h2o-and-domino-for-a-kaggle-competition/

Page 25: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

25

Dear KaggleJoy, Pain, Fear, Gain … and New Friends

Page 26: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

26

Kaggle – The Joy

Page 27: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

27

Kaggle – The Pain & The Fear

Page 28: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

28

Kaggle – The Gain• New Skills• Exploratory Data Analysis• Machine Learning Algorithms• Feature Engineering• Model Stacking• Communication

• THE FEAR OF OVERFITTING!

• New Friends• London Kaggle Meetup

Mickael Joe

Page 29: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

29

Life as a Data Scientist

Page 30: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

30

Toy (In-Class) vs. Kaggle vs. Real-World Data

Page 31: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

31

Story Telling

Page 32: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

32

Story Telling with One Single Slide

Yup. This much space.

Page 33: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

33

Using H2O for Kaggle

Page 34: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

34

XXXXXXX• XXXXXXX• XXXXX• XXXX

• XXXXX• xxxxx

Page 35: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

35

Rossmann Store Sales• Stuck at top 10% for a long time • Final Breakthrough (Mickael)

• Added external data – weather in different cities• 48 hours left

• Model Stacking (Joe)• H2O Deep Learning• Xgboost• Manual process (life before

h2oEnsemble / Stacked Ensembles in H2O)

Page 36: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

36

Santander Product Recommendation• Predict new products that customers

will add in the future• Reframed as a Multiclass

Classification (see next slide)• Feature Engineering

• Basic (Everyone)• Advanced (ZFTurbo, Yifan, Anokas)• Also see Yifan’s slides

• Models• xgboost (ZFTurbo)• H2O GBM (Joe) – Single Best Model

Page 37: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

37

Page 38: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

38

Page 39: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

39

Page 40: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

40

When I Kaggle …

Page 41: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

41

Conclusions

Page 42: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

42

To Kaggle, or not to Kaggle?

Page 43: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

43

New Skills, New Friends & New Opportunities

Giphy is your friend when you don’t have enough time for bullet points.

Page 44: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

44

Differences between Kaggle & Data Science

Quote from Littleboat’s AMA on Kagglenoobs Slack Channel

Page 45: From Kaggle to H2O - The True Story of a Civil Engineer Turned Data Geek

45

• People who have helped me along the way• Kaggle Friends• H2O.ai• Domino Data Lab• Mango Solutions

• Slides• bit.ly/h2o_meetups

• Contact• [email protected]• @matlabulous• github.com/woobe

Thanks!

Making Machine Learning Accessible to Everyone

Photo credit: Virgin Media