why open data science matters
TRANSCRIPT
© 2016 Continuum Analytics - Confidential & Proprietary 1
Why Open Data Science MattersHow Open Data Science is Eating the World
© 2016 Continuum Analytics - Confidential & Proprietary 2
Travis Oliphant• Co-Founder, President, Chief Data Scientist
Continuum Analytics (just stepped down from CEO role)
• PhD in Biomedical Engineering at Mayo Clinic• MS/BS in EE and Math• Professor of EE (Inverse Problems)• Creator of SciPy• Author of NumPy• Founding Chair of NumFOCUS / PyData• Previous Python Software Foundation Director• Co-creator of Anaconda
About Me
PEP 357PEP 3118
SciPy
© 2016 Continuum Analytics - Confidential & Proprietary 3
Business Intelligence & Predictive AnalyticsUsing Data for Insight & Human-in-the-Loop actions
© 2016 Continuum Analytics - Confidential & Proprietary 4
Cognitive IntelligenceUsing Data & Deep Learning to Make Recommendations
© 2016 Continuum Analytics - Confidential & Proprietary 5
© 2016 Continuum Analytics - Confidential & Proprietary 6
© 2016 Continuum Analytics - Confidential & Proprietary 7
Neural network with several layers trained with ~130,000 images.
Matched trained dermatologists with 91% area under sensitivity-specificity curve.
Keys:• Access to Data • Access to Software• Access to Compute
© 2016 Continuum Analytics - Confidential & Proprietary 8
OPPORTUNITY• Car manufacturer “talks” to millions of vehicles every day
on ignition to collect information on the battery, fuel pump, starter, etc.
• 100’s of data-points are transmitted over a long period of time.
• More cars being added each day — more sensors being added over the cell-phone network with 4G LTE coming.
• Much more information will be collected in the coming years to provide more information.
• What do we do with all of this data?
Ensuring you never break downSOLUTION• Data is fed to a “logistic regression” predictive model on
machines at manufacturer’s offices to predict if your car will break down.
• Company would like to be able to make more real-time predictions to preempt even more equipment failures with more data.
© 2016 Continuum Analytics - Confidential & Proprietary 9
Reducing adverse police eventsMost police officers are rational, caring, and well-trained professionals, but excessive force and other “adverse” events between officers and the public still occur in a few cases.
Can we predict when it will happen?What are contributing factors? Initial analysis of police dispatch data in one county
showed that: 1) travel-time to the event2) recent response to “traumatic” cases (such as
suicide and violence) were strongly predictive of future adverse events in a logistic regression model
© 2016 Continuum Analytics - Confidential & Proprietary 10
Open Data ScienceConnecting Data, Analytics & Computation
© 2016 Continuum Analytics - Confidential & Proprietary
“ ”11
An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms.
Data Science is…
© 2016 Continuum Analytics - Confidential & Proprietary 12
The Past vs. Present
Decreasing Use
• Vendor lock in• High costs• Lack of integration• Inability to easily
deploy• Skills gap
Proprietary Software
• Avoids vendor lock in• Reduces cost• Open APIs and
connectors• Eliminates chasm
between build & deploy• Accessible to
tomorrow’s talent
Accelerating AdoptionOpen Source Software
© 2016 Continuum Analytics - Confidential & Proprietary 13
Evolving Technology
• Limited Data Sources• Legacy Compute
Engines• On-premises only
Status QuoProprietary Software
• Big Data• Modern Analytics• Distributed Computing• High Performance
Computing• Hybrid Cloud + On-
premises• Streaming
Next GenerationOpen Source Software
© 2016 Continuum Analytics - Confidential & Proprietary 14
Evolving Roles
• Analyst• Programmer• IT admin
Status QuoProprietary Software
• Data Science Teams• Business Analyst• Quantitative Developer• Data Scientist• Developer• Data Engineer• DevOps
Next GenerationOpen Source Software
15
an inclusive movement that makes open source tools of data science
— data, analytics, & computation — easily work together
as a connected ecosystem
Open Data Science is…
16
Availability | Innovation | Interoperability | TransparencyFor everyone in the data science team
Open Data Science means…
OPEN DATA SCIENCE IS THEFOUNDATION TO MODERNIZATION
© 2016 Continuum Analytics - Confidential & Proprietary
Data Science is not just Machine Learning…
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
© 2016 Continuum Analytics - Confidential & Proprietary
Data Science is Interdisciplinary…
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
Classification, deep learning, Regression, PCA
Hadoop, SparkWeb crawling, scraping, 3rd party data & API providers, predictive services & APIs
GPUs, multi-coresData warehouse, querying, reporting
© 2016 Continuum Analytics - Confidential & Proprietary
Numba
dask
xlwings
Airflow
BlazeOpen Source Communities Creates Powerful Technology for Data Science
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary
Numba
dask
xlwings
Airflow
BlazePython is the common language
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary
Python’s Not the Only One…
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
SQL
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary
But it’s also a Great Glue Language
Distributed Systems
Business Intelligence
Machine Learning / Statistics
Web
Scientific Computing / HPC
SQL
© 2016 Continuum Analytics - Confidential & Proprietary
Numba
dask
xlwings
Airflow
BlazeAnaconda is the Open Data Science Platform Bringing Technology Together…
Distributed Systems
Business Intelligence
Web
Scientific Computing / HPC
Machine Learning / Statistics
© 2016 Continuum Analytics - Confidential & Proprietary 24
We’ve enabled citizens to have a much more direct impact on policy outcomes; the numbers that policy makers are seeing when considering different policies are now generated by tools that can be improved by anyone with the skills and passion to do so.
Matthew Jensen
“”
CHALLENGEWorking with such a large dataset, the team at TaxBrain needed technology that would allow their mathematically intensive economic simulation models to be fast, efficient and easy to access for open source contributors.
SOLUTIONTaxBrain is able to hit performance goals and maintain stronger relationships with open source contributors through the use of the Anaconda platform for development, hosting, package management and high performance speed ups.
ANACONDA SPEEDS UP OPEN SOURCE POLICY MODELING 100X
Arms citizen data scientists with power to evaluate tax policies
© 2016 Continuum Analytics - Confidential & Proprietary 25
Biologists need to know more than just what a healthy and diseased cell image looks like. Our platform, powered by Bokeh on Anaconda, combines contextual cell images together with all our data to empower the discovery of potential drug remedies for rare genetic diseases faster than any other time in history.
Blake Borgeson, CTO & co-founderRecursion Pharmaceuticals
“”
CHALLENGERecursion Pharmaceuticals needs to enable biologists to interactively discover potential drug remedies for rare genetic diseases with a new and innovative drug assay platform. This platform needs to easily show the impact of drug therapies on human cells with cell mutations that cause the loss or gain of cell functions from the rare genetic disease.
SOLUTIONAnaconda and Bokeh power the innovative drug discovery assay platform that combines biology, bioinformatics and machine learning with a self-service, interactive image explorer that makes it easy for biologists to identify crucial cell differences to assess drug efficacy and discover new therapeutic remedies for rare genetic diseases.
ANACONDA & BOKEH COMBINE RIGHT SCIENCE, DATA, EXPLORATION FOR FASTER DRUG THERAPY DISCOVERIES
Empowers scientists to discover drug therapies at the intersection of biology and artificial intelligence
© 2016 Continuum Analytics - Confidential & Proprietary 26
Racial Data vs. Congressional Districtshttps://anaconda.org/jbednar/census-hv-dask/notebook
© 2016 Continuum Analytics - Confidential & Proprietary 27
Empowering the Data Science Team
© 2016 Continuum Analytics - Confidential & Proprietary 28
Modern Data Science Teams use…
• Hadoop / Spark• Programming
Languages• Analytic Libraries• IDE• Notebooks• Visualization
• Spreadsheets• Visualization• Notebooks• Analytic
Development Environment
• Database / Data Warehouse
• ETL
• Programming Languages
• Analytic Libraries• IDE• Notebooks• Visualization
• Database / Data Warehouse
• Middleware• Programming
Languages
Data ScientistBiz Analyst Data EngineerDeveloper DevOps
RIGHT TECHNOLOGY FOR THE PROBLEM
© 2016 Continuum Analytics - Confidential & Proprietary 29
Modern Data Science Teams Want…
DATA SCIENCE COLLABORATION
SELF-SERVICE DATA SCIENCE
DATA SCIENCE DEPLOYMENT
OPEN DATA SCIENCE
© 2016 Continuum Analytics - Confidential & Proprietary 30
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
…is the leading Open Data Science platform powered by Python the fastest growing data science language
© 2016 Continuum Analytics - Confidential & Proprietary 31
INNOVATE faster through managed agile experimentation
MOVE from analysis to deployment immediately
DELIVER powerful results backed by high performance open data science platform
LEVERAGE innovative open source analytics to extract value from data
MAXIMIZE your computational power to easily analyze all data
CONNECT and integrate all your data sources for predictive models
ITERATE quickly to create powerful analysis and predictive models
COLLABORATE and share with your data science team
PUBLISH interactive results to the business
ACCELERATETime-to-Value
CONNECTData, Analytics & Compute
EMPOWERData Science Teams
© 2016 Continuum Analytics - Confidential & Proprietary 32
Open Data Science PlatformACCELERATE. CONNECT. EMPOWER
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Anaconda Gives Superpowers To People Who Change The World
© 2016 Continuum Analytics - Confidential & Proprietary 34
Open Data ScienceVibrant and Growing Community
Python Community
30M+Packages in Anaconda
720+
R Community
16M+Spark Python Usage
50%+
ANACONDADownloads
12M+
© 2016 Continuum Analytics - Confidential & Proprietary 35
Anaconda and Open Data Science Growth
© 2016 Continuum Analytics - Confidential & Proprietary 36
Financial Services• Risk management, Quant modeling, Data exploration
and processing, algorithmic trading, compliance reporting
Government• Fraud detection, data crawling, web & cyber data
analytics, statistical modelingHealthcare & Life Sciences• Genomics data processing, cancer research, natural
language processing for health data scienceHigh Tech• Customer behavior, recommendations, ad bidding,
retargeting, social media analyticsRetail & CPG• Engineering simulation, supply chain modeling,
scientific analysisOil & Gas• Pipeline monitoring, noise logging, seismic data
processing, geophysics
…is Trusted by Industry Leaders
Anaconda
© 2016 Continuum Analytics - Confidential & Proprietary 37
YARN
JVM
Bottom Line10-100X faster performance • Interact with data in HDFS and
Amazon S3 natively from Python• Distributed computations without the
JVM & Python/Java serialization• Framework for easy, flexible
parallelism using directed acyclic graphs (DAGs)
• Interactive, distributed computing with in-memory persistence/caching
Bottom Line• Leverage Python &
R with Spark
Batch Processing Interactive
Processing
HDFS
Ibis
Impala
PySpark & SparkRPython & R ecosystemDask + TensorFlow
High Performance,Interactive,
BatchProcessing
Native read & write
NumPy, Pandas, … 720+ packages
© 2016 Continuum Analytics - Confidential & Proprietary 38
Journey to Open Data Science
© 2016 Continuum Analytics - Confidential & Proprietary 39
1. Reproducibility
2. Governance
3. Open source assurance
What are typical enterprise barriers to enterprises adopting Open Data Science?
© 2016 Continuum Analytics - Confidential & Proprietary 40
Embrace Innovation Without Anarchy
From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft
Reproducibility
© 2016 Continuum Analytics - Confidential & Proprietary 41
Embrace Innovation Without Anarchy
Controlled access to data science assets
Governance
© 2016 Continuum Analytics - Confidential & Proprietary 42
Mitigate legal risk through selection of appropriate OSS license and vendor backed open source assurance
Embrace Innovation Without RiskOpen Source Assurance
© 2016 Continuum Analytics - Confidential & Proprietary 43
Reaching Full potential with Open Data Science
• Make “Code to Data” connection seamless and easy• Amplify learning• Shrink time from idea to production• Choose the right algorithms for your data and question• Cultivate and participate in community mission
Code to Data
Data Silos are everywhere
Python is great glue to connect the Silos
Same data must be storedtwice in memory for different languages becausethere are not common data-descriptions!
Blaze project still working to solve this!
Code to Data
https://www.enterprisetech.com/2017/02/10/ibm-extends-mainframe-support-anaconda/
A lot of the world’s data is still on Mainframe.
Don’t pay to move it. Run analytics directly on mainframe with Open Data Science and Anaconda.
Amplify Learning
Filling this gap with better toolsAnd streamlining education to be a data-scientist
Skills Assessment and Data Science Placement Program
Amplify Learning
Skills Assessment and Data Science Placement Program
Formal “post-graduate” independent-study program with “on-the-job” learning and mentoring that takes people from where they are and improves their data-science, data-engineering, and quantitative programming skills.
Contact me if you:1. want to enroll2. want to “contract-to-hire” someone with Anaconda capabilities
Shrinking Time from Idea to Production
• collaboration• automation• common platforms• shared abstractions• versioning• authentication• governance• recognize it is iterative
ANACONDAFUSION
ANACONDA ENTERPRISE
ANACONDA
ANACONDANARRATIVES
i.e 5.x version of Anaconda Products!
Shrinking Time from Idea to ProductionUsing notebooks as sourcefor dashboards, apps, services, …
Right algorithms
• Supervised Learning — uses “labeled” data to train a model• Regression — predicted variable is continuous • Classification — predicted variable is discrete
• Unsupervised Learning• Clustering — discover categories in the data• Density Estimation — determine representation of data • Dimensionality Reduction — represent data with fewer variables or feature vectors
• Reinforcement Learning — “goal-oriented” learning (e.g. drive a car)• Deep Learning — neural networks with many layers• Semi-supervised Learning (use some labeled data for training)
Right algorithms
There is no magic solution to your problem!
Right algorithms Practical Parallelism for Scale
Ave
rage
Cre
dit C
ard
Pur
chas
e in
US
D
Year
~2300 CSV files with >9million transactions over 8 year from Ashley Madison “hack”
ddf = ddf.repartition(npartitions=100)ndf = ddf.set_index('DATE')ndf.persist()ndf.AMOUNT.resample(‘1M’).mean().compute().plot()
Use read_csv and some transformations in parallel to build a distributed data-frame (with ~2300 partitions, one for each file).
4 nodes with4 cores each
Dask feels modern.
Flexible parallelsim
• machine learning• advanced analytics
and modeling• advanced data
munging
all an import away
Right algorithms
CULTIVATION of COMMUNITY
Great works are started by a small group usually 1-3 people).
CULTIVATION of COMMUNITY
Python community code of Conduct:
A member of the Python community is:• Open• Considerate• Respectful
CULTIVATION of COMMUNITY
© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary
Continuum AnalyticsWe empower data science teams to make the world a better placeWe Empower Data Science Teams to Make the World Better