data science team collaboration: forget about meeting me halfway, take me the last mile |...

44
DATA SCIENCE TEAM COLLABORATION FORGET ABOUT MEETING ME HALFWAY, TAKE ME THE LAST MILE

Upload: continuum-analytics

Post on 12-Apr-2017

108 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

DATA SCIENCETEAM COLLABORATION

FORGET ABOUT MEETING ME HALFWAY,TAKE ME THE LAST MILE

Page 2: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

Page 3: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

OGT molecular dynamics simulationProtein “mouth” opening, 1us

Page 4: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokesCERN computing facilityGeneva, Switzerland

Page 5: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

Page 6: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

Page 7: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

SUCCESS COMES FROM TEAM WORK

Page 8: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

IAN: ENGINEER, PHYSICIST, BIOLOGIST?

• Ian Stokes-Rees, @ijstokes• Product Marketing Manager• Computational Scientist• Passionate advocate of

Open Data Science• Educator and evangelist for use of

Python and Anaconda

Page 9: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FIRST TASTE OF “BIG DATA” COMPUTING

• 100,000 acoustic tri-phone models• 100 parameters per model• 10 million parameters to estimate• adaptation = real-time adjustment• computation = tricky!

Page 10: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

PhD on CERN LHCb COMPUTING TEAM

Distributed computing infrastructure• 1000s of concurrent users• 100s of federated computing centers

• no centralized control• 1M+ servers with software installed• 20+ year life span• 20 GB of data per second• 14 hours per day• 7 days a week• 7 months of the year

March 26, 2010 LHCb first physics at 3.5 TeV

Page 11: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HOW DO CERN PHYSICISTS DO THIS?

• Some smart people over there• Who brought us the Web, HTTP, and HTML?

• Big Data• Multi-PB per year

• Large collaborating teams• 1000s of people accessing systems

• Computation critical• Or there is no way to make sense of the data• And discover new physics December 2, 2016

LHCb proton-lead collisions

Page 12: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

CERN ATLAS detectorCalorimeter end cap wiring harnessMillions of data feeds @ 40 MHz signal rate

Page 13: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HOW WOULD YOU DO IT?

Custom hardware:CMS L0 muon trigger ASIC

Giant compute and storage clusters

Wicked fast algorithmswritten in Fortran and C

Python: the Swiss army knife for computational physics

Page 14: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

PYTHON: LINGUA FRANCA FOR DATA SCIENCE

• Human readable• Easy to learn• Object oriented• Cleanly wraps C and Fortran• Amazing foundation of high

quality data science libraries• Suitable for scripting,

algorithms, data processing and applications

Page 15: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

THE CALCULUS OF NEWTON AND LEIBNIZ

Page 16: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

SOMETIMES ESOTERIC IS OK

http://bit.ly/ac17-collab

Page 17: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HERMITS AND HIGH PRIESTS

NPS, Richard Proenneke 1985

Page 18: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

MOLECULAR BIOLOGY:FROM PROTONS TO PROTEINS

• It takes 3-9 months in the wet lab to prepare protein samples

• Once prepared it is only a few days to ”image” those samples and produce digitized representations

• However the “images” aren’t yet 3D atomic models

• That takes from weeks to months to complete, sitting behind a computer

• You may know it as protein folding

Nature, 2011 PMID: 21240259Lazarus, Nam, Jiang, Sliz, Walker

Page 19: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

HOW DO WE ACCELERATETHE TIME TO INSIGHT?

http://bit.ly/ac17-collab

Page 20: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

Page 21: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

WHAT DOES “HALF WAY” LOOK LIKE?

Today’s “good” data science environment:• Provide high performance computing resources

• For example, Hadoop infrastructure• Deploy a wide selection of the most popular analysis software

• Training and documentation• Technical support

Page 22: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FISH OUT OF WATER

• Why would we take an expert biochemist and force them to be

• A software engineer?• An IT system administrator?• A statistician?

• What can we do to let them focus on being a great biochemist?

Page 23: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FISH OUT OF WATER

• Why would we take an expert business analyst and force them to be

• A software engineer?• An IT system administrator?• A statistician?

• What can we do to let them focus on being a great business analyst?

Page 24: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

Page 25: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

TAKE ME THE LAST MILE

• DevOps engineer pre-configures scalable computation• Laptop to server to cluster• DevOps team is a partner, not a service provider

• Software engineer creates and customizes software for the task, project or individual

• Avoiding generic, static software setups

• Data scientist composes workflow• Analyst is provided simple high level interface

• With option to “drill down”

Page 26: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

WHAT ABOUT THOSE PROTEINS?

• Normally it takes 10-200 hours of computing time to match a ”template” protein fragment to the imaging data

• There are 100k templates (known protein “folds”) to choose from• ”Be stupid” and just try them all – sometimes you’ll be surprised!• I spent 18 months working with biochemists and IT sys admins across

the country to create a sensible parallel & distributed workflow• 4-40 hours wall clock time to run 2k-20k hour parallel computation• Real-time updates of results• Web based interface to access summary and detailed data viz• Analysis performed in Jupyter Notebook, allowing customization• File-system based to enable “drill down” and direct access• 6M hours per year (~700 years), peak parallelism 20k cores

Page 27: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

DATA SCIENCE PATTERN

• How is it done today?• What is the opportunity for improvement?• Prototype and evaluate – is it better? Rinse and repeat• Standardize and automate the workflow/model• Scale the workflow/model• Preprocess and distribute the data• Instrument execution and set quality metrics• Establish easy access interface• Create programmatic APIs

FIN

Page 28: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

SUCCESS COMES FROM TEAM WORK

Remember the footnote?Collaborative cross-functional teams

http://bit.ly/ac17-collab

Page 29: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

BREAKING DATA SCIENCE OPEN

Page 30: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

ANACONDA & COLLABORATION

http://bit.ly/ac17-collab

Page 31: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 1: ANACONDA

http://continuum.io/downloads

Page 32: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

Page 33: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

Page 34: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

Page 35: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

NOTEBOOKS FOR DATA SCIENCE COLLABORATION

Do you understand why notebooks are so popular?There are many angles to this, but my take:

• Visual record of the data science process• They tell a story, and support rich hyperlinked prose• Data can be embedded• Algorithms or analysis techniques are captured• Interactive visualizations are inline• Sharable• Reproducible*

Page 36: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: ANACONDA CLOUD

http://anaconda.org

Page 37: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: ANACONDA CLOUD

Page 38: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

http://anaconda.org/ijstokes

Page 39: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

Page 40: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

Page 41: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 3: ANACONDA ENTERPRISE (TODAY)

Page 42: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 3: ANACONDA ENTERPRISE (COMING SOON)

Page 43: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

ANACONDA:GIVING SUPERPOWERS TO THE PEOPLEWHO CHANGE THE WORLD

TEAMS

http://bit.ly/ac17-collab

Page 44: Data Science Team Collaboration: Forget About Meeting Me Halfway, Take Me the Last Mile | AnacondaCON 2017

THANK YOU! QUESTIONS?

Ian Stokes-Rees @ijstokes

http://bit.ly/ac17-collab