data science team collaboration: forget about meeting me halfway, take me the last mile |...

Post on 12-Apr-2017

108 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

DATA SCIENCETEAM COLLABORATION

FORGET ABOUT MEETING ME HALFWAY,TAKE ME THE LAST MILE

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

OGT molecular dynamics simulationProtein “mouth” opening, 1us

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokesCERN computing facilityGeneva, Switzerland

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

SUCCESS COMES FROM TEAM WORK

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

IAN: ENGINEER, PHYSICIST, BIOLOGIST?

• Ian Stokes-Rees, @ijstokes• Product Marketing Manager• Computational Scientist• Passionate advocate of

Open Data Science• Educator and evangelist for use of

Python and Anaconda

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FIRST TASTE OF “BIG DATA” COMPUTING

• 100,000 acoustic tri-phone models• 100 parameters per model• 10 million parameters to estimate• adaptation = real-time adjustment• computation = tricky!

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

PhD on CERN LHCb COMPUTING TEAM

Distributed computing infrastructure• 1000s of concurrent users• 100s of federated computing centers

• no centralized control• 1M+ servers with software installed• 20+ year life span• 20 GB of data per second• 14 hours per day• 7 days a week• 7 months of the year

March 26, 2010 LHCb first physics at 3.5 TeV

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HOW DO CERN PHYSICISTS DO THIS?

• Some smart people over there• Who brought us the Web, HTTP, and HTML?

• Big Data• Multi-PB per year

• Large collaborating teams• 1000s of people accessing systems

• Computation critical• Or there is no way to make sense of the data• And discover new physics December 2, 2016

LHCb proton-lead collisions

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

CERN ATLAS detectorCalorimeter end cap wiring harnessMillions of data feeds @ 40 MHz signal rate

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HOW WOULD YOU DO IT?

Custom hardware:CMS L0 muon trigger ASIC

Giant compute and storage clusters

Wicked fast algorithmswritten in Fortran and C

Python: the Swiss army knife for computational physics

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

PYTHON: LINGUA FRANCA FOR DATA SCIENCE

• Human readable• Easy to learn• Object oriented• Cleanly wraps C and Fortran• Amazing foundation of high

quality data science libraries• Suitable for scripting,

algorithms, data processing and applications

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

THE CALCULUS OF NEWTON AND LEIBNIZ

SOMETIMES ESOTERIC IS OK

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

HERMITS AND HIGH PRIESTS

NPS, Richard Proenneke 1985

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

MOLECULAR BIOLOGY:FROM PROTONS TO PROTEINS

• It takes 3-9 months in the wet lab to prepare protein samples

• Once prepared it is only a few days to ”image” those samples and produce digitized representations

• However the “images” aren’t yet 3D atomic models

• That takes from weeks to months to complete, sitting behind a computer

• You may know it as protein folding

Nature, 2011 PMID: 21240259Lazarus, Nam, Jiang, Sliz, Walker

HOW DO WE ACCELERATETHE TIME TO INSIGHT?

http://bit.ly/ac17-collab

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

WHAT DOES “HALF WAY” LOOK LIKE?

Today’s “good” data science environment:• Provide high performance computing resources

• For example, Hadoop infrastructure• Deploy a wide selection of the most popular analysis software

• Training and documentation• Technical support

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FISH OUT OF WATER

• Why would we take an expert biochemist and force them to be

• A software engineer?• An IT system administrator?• A statistician?

• What can we do to let them focus on being a great biochemist?

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

FISH OUT OF WATER

• Why would we take an expert business analyst and force them to be

• A software engineer?• An IT system administrator?• A statistician?

• What can we do to let them focus on being a great business analyst?

SUCCESS COMES FROM TEAM WORK

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

TAKE ME THE LAST MILE

• DevOps engineer pre-configures scalable computation• Laptop to server to cluster• DevOps team is a partner, not a service provider

• Software engineer creates and customizes software for the task, project or individual

• Avoiding generic, static software setups

• Data scientist composes workflow• Analyst is provided simple high level interface

• With option to “drill down”

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

WHAT ABOUT THOSE PROTEINS?

• Normally it takes 10-200 hours of computing time to match a ”template” protein fragment to the imaging data

• There are 100k templates (known protein “folds”) to choose from• ”Be stupid” and just try them all – sometimes you’ll be surprised!• I spent 18 months working with biochemists and IT sys admins across

the country to create a sensible parallel & distributed workflow• 4-40 hours wall clock time to run 2k-20k hour parallel computation• Real-time updates of results• Web based interface to access summary and detailed data viz• Analysis performed in Jupyter Notebook, allowing customization• File-system based to enable “drill down” and direct access• 6M hours per year (~700 years), peak parallelism 20k cores

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

DATA SCIENCE PATTERN

• How is it done today?• What is the opportunity for improvement?• Prototype and evaluate – is it better? Rinse and repeat• Standardize and automate the workflow/model• Scale the workflow/model• Preprocess and distribute the data• Instrument execution and set quality metrics• Establish easy access interface• Create programmatic APIs

FIN

SUCCESS COMES FROM TEAM WORK

Remember the footnote?Collaborative cross-functional teams

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

BREAKING DATA SCIENCE OPEN

ANACONDA & COLLABORATION

http://bit.ly/ac17-collab

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 1: ANACONDA

http://continuum.io/downloads

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

NOTEBOOKS FOR DATA SCIENCE COLLABORATION

Do you understand why notebooks are so popular?There are many angles to this, but my take:

• Visual record of the data science process• They tell a story, and support rich hyperlinked prose• Data can be embedded• Algorithms or analysis techniques are captured• Interactive visualizations are inline• Sharable• Reproducible*

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: ANACONDA CLOUD

http://anaconda.org

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: ANACONDA CLOUD

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

http://anaconda.org/ijstokes

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 2: (MY) ANACONDA CLOUD

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 3: ANACONDA ENTERPRISE (TODAY)

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

STEP 3: ANACONDA ENTERPRISE (COMING SOON)

ANACONDA:GIVING SUPERPOWERS TO THE PEOPLEWHO CHANGE THE WORLD

TEAMS

http://bit.ly/ac17-collab

THANK YOU! QUESTIONS?

Ian Stokes-Rees @ijstokes

http://bit.ly/ac17-collab

top related