journey to open data science
TRANSCRIPT
© 2016 Continuum Analytics - Proprietary
JOURNEY TO OPEN DATA SCIENCE
Michele Chambers, CMO & VP ProductsChristine Doig, Sr. Data Scientist & Product Marketing ManagerContinuum Analytics
2© 2016 Continuum Analytics- Confidential & Proprietary
• Michele Chambers @ mcAnalyticsCMO & VP Product Continuum AnalyticsM.B.A Duke University, B.S. Computer Engineering
AuthorBig Data Big Analytics Wiley
Modern Analytics Methodologies: Driving Business Value with Analytics Pearson FT Press
Advanced Analytics Methodologies: Driving Business Value with Analytics Pearson FT Press
About Us
3© 2016 Continuum Analytics- Confidential & Proprietary
• Christine Doig @ch_doigSenior Data Scientist & Product Marketing Manager Continuum Analytics
M.S. Polytechnic University of Catalonia in Industrial Engineering
Open Source advocate and speakerPyData, EuroPython, SciPy, PyCon
5+ years in advanced analytics, operations research, machine learning in energy, manufacturing & banking
About Us
WHAT’S OPEN DATA SCIENCE?
© 2016 Continuum Analytics- Confidential & Proprietary
5
“ ”An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms
Wikipedia
Data Science is …
© 2016 Continuum Analytics- Confidential & Proprietary
6
Data Science is not just Machine Learning…
© 2016 Continuum Analytics - Confidential & Proprietary
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
7
Data Science is Interdisciplinary…
© 2016 Continuum Analytics - Confidential & Proprietary
Hadoop, Spark
GPUs, multi-cores
Classification, deep learning
Regression, PCA
Web crawling, scraping, 3rd party data & API providers, predictive
services & APIs
Data warehouse, querying, reporting
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
Open Data Science is …
an inclusive movement
that makes open source tools of data science -- data, analytics, & computation – easily work together as a connected ecosystem
© 2016 Continuum Analytics- Confidential & Proprietary 8
Open Data Science Means Open….
AvailabilityInnovation
InteroperabilityTransparency
For everyone in the data science team
© 2016 Continuum Analytics- Confidential & Proprietary
OPEN DATA SCIENCE IS THE FOUNDATION TO MODERNIZATION
9
10© 2016 Continuum Analytics - Confidential & Proprietary
Why are major corporations moving to Modern Analytics & Open Data Science?
Large Investment Banks Major Upstream Oil & GasGlobal CPG ManufacturersHow can I create and
deploy timely risk models? How can I possibly identify
the root causes of my complex problem and
remediate early enough to create revenue assurance?
How can I take advantage of all this new sensor
information now?
11
Industry Leaders Trusting Open Data ScienceOpen Data Science Community
Python Community 30M+R Community
16M +Spark Python Usage
50% Anaconda Downloads 3M+
© 2016 Continuum Analytics- Confidential & Proprietary
12
Open Source Communities Creates Powerful Technology for Data Science
© 2016 Continuum Analytics- Confidential & Proprietary
Numba
dask
xlwings
Airflow
Blaze
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
13
Python is the Common Language
© 2016 Continuum Analytics- Confidential & Proprietary
Numba
dask
xlwings
Airflow
Blaze
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
14© 2016 Continuum Analytics- Confidential & Proprietary
Python Trusted by Industry Leaders
15
“ ”Everyone at JPMorgan now needs to know Python and there are around 5,000 developers using it at Bank of America. There are close to 10 million lines of Python code in Quartz and we got close to 3,000 commits a day. It’s a good scripting language and easily integrated into both the front and back ends, which was one of the reasons we chose it in the first place.
Kirat Singh, Former Global Head of Risk Systems, Bank of America Merrill Lynch
Python is Everywhere
© 2016 Continuum Analytics- Confidential & Proprietary
16© 2016 Continuum Analytics- Confidential & Proprietary
Journey to Open Data Science
17© 2016 Continuum Analytics - Confidential & Proprietary
Before
• Proprietary Technology– Variety of DBs & DWs– Excel, SQL, Custom Code, SAS
• Problem– Hard to find people to create
proprietary risk assessment models– Takes months and years to deploy
After
• Open Data Science Technology– Python & Anaconda– NumPy, SciPy, PyData stack
• Results– Create and deploy risk models in
days not years – Easier to find and hire data scientists
Why Companies are Migrating to ODS…
Large Investment Bank
18
• Proprietary Technology– Matlab, Custom Fortran– Perl, SQL
• Problem– Complex model and simulation required
with disparate internal and external data
Before
Global CPG Manufacturer
© 2016 Continuum Analytics - Confidential & Proprietary
After
• Open Data Science Technology– Anaconda – Repository, PyData,
Fortran• Results
– Integrated multiple data feeds– Created full lifecycle predictive
model and simulation for revenue assurance
Why Companies are Migrating to ODS…
19
• Proprietary Technology– Industry specific visualization
• Problem– Unable to ingest Big Data
from sensors to proactively monitor oil well holes
Before
Major Upstream Oil & Gas
© 2016 Continuum Analytics - Confidential & Proprietary
After• Open Data Science Technology
– Streaming visualization with Bokeh• Results
– Created novel visualizations and predictive models using sensor data
– Gained insights into oil hole issues in weeks not years to detect issues earlier and increase profitability
Why Companies are Migrating to ODS…
20
Python’s Not the Only One…
© 2016 Continuum Analytics- Confidential & Proprietary
SQL
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
21
But it’s also a Great Glue Language
© 2016 Continuum Analytics- Confidential & Proprietary
SQL
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
22
Anaconda is the Open Data Science Platform Bringing Technology Together…
© 2016 Continuum Analytics- Confidential & Proprietary
Numba
daskAirflow
SQL
xlwings Blaze
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
23© 2016 Continuum Analytics- Confidential & Proprietary
But Most Importantly Empowering Everyone on the Data Science Team
Data ScientistBiz Analyst Data EngineerDeveloper DevOps
Deploy & Operate
Explore & Analyze
Collaborate & Publish
How are Modern Roles Different from Traditional Roles?
© 2016 Continuum Analytics- Confidential & Proprietary
Team | CollaborativeIndividual | Silo
Modern RolesTraditional Roles
24
25
Modern Data Science Teams use…
© 2016 Continuum Analytics- Confidential & Proprietary
Data Scientist• Hadoop / Spark• Programming
Languages• Analytic Libraries• IDE• Notebooks• Visualization
Biz Analyst• Spreadsheets• Visualization• Notebooks• Analytic
Development Environment
Data Engineer• Database / Data
Warehouse• ETL
Developer• Programming
Languages• Analytic Libraries• IDE• Notebooks• Visualization
DevOps• Database / Data
Warehouse• Middleware• Programming
Languages
RIGHT TECHNOLOGY FOR THE PROBLEM
26
Modern Data Science Teams Want
© 2016 Continuum Analytics- Confidential & Proprietary
Collaboration
• Iterate on analysis• Share discoveries with team• Interact with teams across
the globe
Interactivity
• Interact with data• Build high performance
models• Visualize results in context
Integration
• Work with open source and legacy data systems
• Leverage data science languages: Python, R, Matlab, SAS, SPSS, Excel, Java, C/C++, C#, .NET, FORTRAN and more
Predict
Share
Deploy
with Open Data Science
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
27
is….the leading Open Data Science platform powered by Python the fastest growing open data science language
28© 2015 Continuum Analytics- Confidential & Proprietary
ACCELERATETime-to-Value
INNOVATE faster through managed agile experimentation
MOVE from analysis to deployment immediately
DELIVER high performance analytics processing
CONNECTData, Analytics & Compute
LEVERAGE innovative open source analytics to extract value from data
MAXIMIZE your computational power to easily analyze all your data
CONNECT and integrate all your data sources for predictive models
EMPOWERData Science Teams
ITERATE quickly to create powerful analysis and predictive models
COLLABORATE and share with your data science team
PUBLISH interactive results to the business
29© 2015 Continuum Analytics- Confidential & Proprietary
Introducing AnacondaThe Open Data Science Platform Powered by Python
Enterprise Ready Platform– Simplify administration– Use open data science– Collaborate with entire team– Leverage modern architectures– Integrate data sources– Accelerate performance
OPER
ATION
S
DATA SC
IENC
E LA
NG
UA
GES
APPLICATIONS
DATA
HARDWARE
ANALYTICS
Model Building
Analytics DevelopmentData Exploration
SOFTWARE DEVELOPMENT
HIGH PERFORMANCE
Cloud On-premises
Business Analyst
Data Scientist
Developer
DataEngineer
DevOps
Data Science Team
BSD LicensedSupport IndemnificationTrainingConsulting
DEMOS
© 2016 Continuum Analytics- Confidential & Proprietary
32© 2015 Continuum Analytics- Confidential & Proprietary
• Anaconda Enterprise Notebooks: A collaborative environment for Data
Science teams
• Anaconda for Excel: Bringing Advanced Analytics and Interactive
Visualizations to MS Excel
© 2016 Continuum Analytics- Confidential & Proprietary
ANACONDA ENTERPRISE NOTEBOOKSA COLLABORATIVE ENVIRONMENT FOR DATA SCIENCE TEAMS
34© 2015 Continuum Analytics- Confidential & Proprietary
Search projects per tag and collaborators
Manage contributors
Manage collaborative projects
35© 2015 Continuum Analytics- Confidential & Proprietary
Organize notebooks, scripts and other files in projects
Manage teams’ collaborators
Save favorite projects
36© 2015 Continuum Analytics- Confidential & Proprietary
Data lineage
Access to collaborative executable notebooks
Interactive Visualizations
Advanced notebook extensions
37
Use advanced notebook extensions for enhanced collaboration
• Publishing to Anaconda Repository integration• Revision control, commit and notebook diff comparison• Collaborative locking• Advanced interactive presentations editor
38© 2015 Continuum Analytics- Confidential & Proprietary
Easily publish and share your results with Business Leaders and Analysts
39
Leverage revision control, commit and diff comparison in notebooks
Notebooks version tracking Notebooks changes diff comparison
Commit your work to be able to go back to, and compare changes with other revisions
40
Collaborate with notebooks locking features
41© 2015 Continuum Analytics- Confidential & Proprietary
Transform notebook into an Interactive Presentation with an advanced editor
Edit slides layout and content
Edit slides theme
Present your slides with embedded interactive visualizations
© 2016 Continuum Analytics- Confidential & Proprietary
ANACONDA FOR EXCELBRINGING ADVANCED ANALYTICS AND INTERACTIVE VISUALIZATIONS TO MS EXCEL
Create browser-based Interactive Visualizations directly from your spreadsheet
Write your visualization directly into the formula
Access a powerful interactive toolbox
Enhance exploration with a customizable hover tool
Interactively explore your spreadsheet data with the crossfilter app
Select variables to plot, and color, palette and size of the points
Immediately view your updates in the visualization
Access advanced Machine Learning models to cluster your data
Simple formulas for advanced modeling applications
Easily input variables into algorithms with interactive widgets
Access a wide range of modeling algorithms
46
Enterprise Ready Open Data Science platform– Interactive Visualization for Fast Exploration– Data Science Team Collaboration– Publishing & Sharing of Data Science Results– Scale Up & Scale Out Advanced Analytics– Governance, Provenance, & Security
Without the proprietary vendor cost & lock-in
© 2016 Continuum Analytics- Confidential & Proprietary
ANACONDA GIVES SUPERPOWERS TO PEOPLE WHO CHANGE THE WORLD
© 2016 Continuum Analytics- Confidential & Proprietary
48
Modern Data Science Teams Love Anaconda
© 2016 Continuum Analytics- Confidential & Proprietary
Data Scientist• Hadoop / Spark• Programming
Languages• Analytic Libraries• Notebooks• Visualization• IDE
Biz Analyst• Spreadsheets• Visualization• Notebooks• Analytic
Development Environment
Data Engineer• Database / Data
Warehouse• ETL
Developer• Programming
Languages• Analytic Libraries• IDE• Notebooks• Visualization
DevOps• Database / Data
Warehouse• Middleware• Programming
Languages
49© 2016 Continuum Analytics- Confidential & Proprietary
Anaconda Trusted by Industry LeadersFinancial Services
Risk Mgmt, Quant modeling, Data exploration and processing, algorithmic trading, compliance reporting
GovernmentFraud detection, data crawling, web & cyber data analytics, statistical modeling
Healthcare & Life SciencesGenomics data processing, cancer research, natural language processing for health data science
High TechCustomer behavior, recommendations, ad bidding, retargeting, social media analytics
Retail & CPGEngineering simulation, supply chain modeling, scientific analysis
Oil & GasPipeline monitoring, noise logging, seismic data processing, geophysics
50
Anaconda Subscriptions
© 2015 Continuum Analytics- Confidential & Proprietary
51
Next Steps
• Open Data Science Journey Assessment [email protected] to schedule assessment
• Download Anaconda https://www.continuum.io/downloads
• Migrate your first model to ODS [email protected] to schedule a POC
© 2016 Continuum Analytics- Confidential & Proprietary
52© 2016 Continuum Analytics- Confidential & Proprietary
Thank YouMichele ChambersTwitter: @mcAnalytics
Christine DoigTwitter: @ch_doig
Email: [email protected]: @ContinuumIO
221 W. 6th StreetSuite #1550Austin, TX 78701+1 512.222.5440
@ContinuumIO
CONTINUUM ANALYTICS
We Empower Data Science Teams to Change the World