why open data science matters

58
© 2016 Continuum Analytics - Confidential & Proprietary 1 Why Open Data Science Matters How Open Data Science is Eating the World

Upload: continuum-analytics

Post on 21-Feb-2017

510 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 1

Why Open Data Science MattersHow Open Data Science is Eating the World

Page 2: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 2

Travis Oliphant• Co-Founder, President, Chief Data Scientist

Continuum Analytics (just stepped down from CEO role)

• PhD in Biomedical Engineering at Mayo Clinic• MS/BS in EE and Math• Professor of EE (Inverse Problems)• Creator of SciPy• Author of NumPy• Founding Chair of NumFOCUS / PyData• Previous Python Software Foundation Director• Co-creator of Anaconda

About Me

PEP 357PEP 3118

SciPy

Page 3: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 3

Business Intelligence & Predictive AnalyticsUsing Data for Insight & Human-in-the-Loop actions

Page 4: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 4

Cognitive IntelligenceUsing Data & Deep Learning to Make Recommendations

Page 5: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 5

Page 6: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 6

Page 7: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 7

Neural network with several layers trained with ~130,000 images.

Matched trained dermatologists with 91% area under sensitivity-specificity curve.

Keys:• Access to Data • Access to Software• Access to Compute

Page 8: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 8

OPPORTUNITY• Car manufacturer “talks” to millions of vehicles every day

on ignition to collect information on the battery, fuel pump, starter, etc.

• 100’s of data-points are transmitted over a long period of time.

• More cars being added each day — more sensors being added over the cell-phone network with 4G LTE coming.

• Much more information will be collected in the coming years to provide more information.

• What do we do with all of this data?

Ensuring you never break downSOLUTION• Data is fed to a “logistic regression” predictive model on

machines at manufacturer’s offices to predict if your car will break down.

• Company would like to be able to make more real-time predictions to preempt even more equipment failures with more data.

Page 9: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 9

Reducing adverse police eventsMost police officers are rational, caring, and well-trained professionals, but excessive force and other “adverse” events between officers and the public still occur in a few cases.

Can we predict when it will happen?What are contributing factors? Initial analysis of police dispatch data in one county

showed that: 1) travel-time to the event2) recent response to “traumatic” cases (such as

suicide and violence) were strongly predictive of future adverse events in a logistic regression model

Page 10: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 10

Open Data ScienceConnecting Data, Analytics & Computation

Page 11: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

“ ”11

An interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms.

Data Science is…

Page 12: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 12

The Past vs. Present

Decreasing Use

• Vendor lock in• High costs• Lack of integration• Inability to easily

deploy• Skills gap

Proprietary Software

• Avoids vendor lock in• Reduces cost• Open APIs and

connectors• Eliminates chasm

between build & deploy• Accessible to

tomorrow’s talent

Accelerating AdoptionOpen Source Software

Page 13: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 13

Evolving Technology

• Limited Data Sources• Legacy Compute

Engines• On-premises only

Status QuoProprietary Software

• Big Data• Modern Analytics• Distributed Computing• High Performance

Computing• Hybrid Cloud + On-

premises• Streaming

Next GenerationOpen Source Software

Page 14: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 14

Evolving Roles

• Analyst• Programmer• IT admin

Status QuoProprietary Software

• Data Science Teams• Business Analyst• Quantitative Developer• Data Scientist• Developer• Data Engineer• DevOps

Next GenerationOpen Source Software

Page 15: Why Open Data Science Matters

15

an inclusive movement that makes open source tools of data science

— data, analytics, & computation — easily work together

as a connected ecosystem

Open Data Science is…

Page 16: Why Open Data Science Matters

16

Availability | Innovation | Interoperability | TransparencyFor everyone in the data science team

Open Data Science means…

OPEN DATA SCIENCE IS THEFOUNDATION TO MODERNIZATION

Page 17: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Data Science is not just Machine Learning…

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

Page 18: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Data Science is Interdisciplinary…

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

Classification, deep learning, Regression, PCA

Hadoop, SparkWeb crawling, scraping, 3rd party data & API providers, predictive services & APIs

GPUs, multi-coresData warehouse, querying, reporting

Page 19: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Numba

dask

xlwings

Airflow

BlazeOpen Source Communities Creates Powerful Technology for Data Science

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 20: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Numba

dask

xlwings

Airflow

BlazePython is the common language

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 21: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Python’s Not the Only One…

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

SQL

Machine Learning / Statistics

Page 22: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

But it’s also a Great Glue Language

Distributed Systems

Business Intelligence

Machine Learning / Statistics

Web

Scientific Computing / HPC

SQL

Page 23: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary

Numba

dask

xlwings

Airflow

BlazeAnaconda is the Open Data Science Platform Bringing Technology Together…

Distributed Systems

Business Intelligence

Web

Scientific Computing / HPC

Machine Learning / Statistics

Page 24: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 24

We’ve enabled citizens to have a much more direct impact on policy outcomes; the numbers that policy makers are seeing when considering different policies are now generated by tools that can be improved by anyone with the skills and passion to do so.

Matthew Jensen

“”

CHALLENGEWorking with such a large dataset, the team at TaxBrain needed technology that would allow their mathematically intensive economic simulation models to be fast, efficient and easy to access for open source contributors.

SOLUTIONTaxBrain is able to hit performance goals and maintain stronger relationships with open source contributors through the use of the Anaconda platform for development, hosting, package management and high performance speed ups.

ANACONDA SPEEDS UP OPEN SOURCE POLICY MODELING 100X

Arms citizen data scientists with power to evaluate tax policies

Page 25: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 25

Biologists need to know more than just what a healthy and diseased cell image looks like. Our platform, powered by Bokeh on Anaconda, combines contextual cell images together with all our data to empower the discovery of potential drug remedies for rare genetic diseases faster than any other time in history.

Blake Borgeson, CTO & co-founderRecursion Pharmaceuticals

“”

CHALLENGERecursion Pharmaceuticals needs to enable biologists to interactively discover potential drug remedies for rare genetic diseases with a new and innovative drug assay platform. This platform needs to easily show the impact of drug therapies on human cells with cell mutations that cause the loss or gain of cell functions from the rare genetic disease.

SOLUTIONAnaconda and Bokeh power the innovative drug discovery assay platform that combines biology, bioinformatics and machine learning with a self-service, interactive image explorer that makes it easy for biologists to identify crucial cell differences to assess drug efficacy and discover new therapeutic remedies for rare genetic diseases.

ANACONDA & BOKEH COMBINE RIGHT SCIENCE, DATA, EXPLORATION FOR FASTER DRUG THERAPY DISCOVERIES

Empowers scientists to discover drug therapies at the intersection of biology and artificial intelligence

Page 26: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 26

Racial Data vs. Congressional Districtshttps://anaconda.org/jbednar/census-hv-dask/notebook

Page 27: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 27

Empowering the Data Science Team

Page 28: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 28

Modern Data Science Teams use…

• Hadoop / Spark• Programming

Languages• Analytic Libraries• IDE• Notebooks• Visualization

• Spreadsheets• Visualization• Notebooks• Analytic

Development Environment

• Database / Data Warehouse

• ETL

• Programming Languages

• Analytic Libraries• IDE• Notebooks• Visualization

• Database / Data Warehouse

• Middleware• Programming

Languages

Data ScientistBiz Analyst Data EngineerDeveloper DevOps

RIGHT TECHNOLOGY FOR THE PROBLEM

Page 29: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 29

Modern Data Science Teams Want…

DATA SCIENCE COLLABORATION

SELF-SERVICE DATA SCIENCE

DATA SCIENCE DEPLOYMENT

OPEN DATA SCIENCE

Page 30: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 30

• Accelerate Time-to-Value

• Connect Data, Analytics & Compute

• Empower Data Science Teams

…is the leading Open Data Science platform powered by Python the fastest growing data science language

Page 31: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 31

INNOVATE faster through managed agile experimentation

MOVE from analysis to deployment immediately

DELIVER powerful results backed by high performance open data science platform

LEVERAGE innovative open source analytics to extract value from data

MAXIMIZE your computational power to easily analyze all data

CONNECT and integrate all your data sources for predictive models

ITERATE quickly to create powerful analysis and predictive models

COLLABORATE and share with your data science team

PUBLISH interactive results to the business

ACCELERATETime-to-Value

CONNECTData, Analytics & Compute

EMPOWERData Science Teams

Page 32: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 32

Open Data Science PlatformACCELERATE. CONNECT. EMPOWER

Page 33: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda Gives Superpowers To People Who Change The World

Page 34: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 34

Open Data ScienceVibrant and Growing Community

Python Community

30M+Packages in Anaconda

720+

R Community

16M+Spark Python Usage

50%+

ANACONDADownloads

12M+

Page 35: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 35

Anaconda and Open Data Science Growth

Page 36: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 36

Financial Services• Risk management, Quant modeling, Data exploration

and processing, algorithmic trading, compliance reporting

Government• Fraud detection, data crawling, web & cyber data

analytics, statistical modelingHealthcare & Life Sciences• Genomics data processing, cancer research, natural

language processing for health data scienceHigh Tech• Customer behavior, recommendations, ad bidding,

retargeting, social media analyticsRetail & CPG• Engineering simulation, supply chain modeling,

scientific analysisOil & Gas• Pipeline monitoring, noise logging, seismic data

processing, geophysics

…is Trusted by Industry Leaders

Anaconda

Page 37: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 37

YARN

JVM

Bottom Line10-100X faster performance • Interact with data in HDFS and

Amazon S3 natively from Python• Distributed computations without the

JVM & Python/Java serialization• Framework for easy, flexible

parallelism using directed acyclic graphs (DAGs)

• Interactive, distributed computing with in-memory persistence/caching

Bottom Line• Leverage Python &

R with Spark

Batch Processing Interactive

Processing

HDFS

Ibis

Impala

PySpark & SparkRPython & R ecosystemDask + TensorFlow

High Performance,Interactive,

BatchProcessing

Native read & write

NumPy, Pandas, … 720+ packages

Page 38: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 38

Journey to Open Data Science

Page 39: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 39

1. Reproducibility

2. Governance

3. Open source assurance

What are typical enterprise barriers to enterprises adopting Open Data Science?

Page 40: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 40

Embrace Innovation Without Anarchy

From http://www.slideshare.net/RevolutionAnalytics/r-at-microsoft

Reproducibility

Page 41: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 41

Embrace Innovation Without Anarchy

Controlled access to data science assets

Governance

Page 42: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 42

Mitigate legal risk through selection of appropriate OSS license and vendor backed open source assurance

Embrace Innovation Without RiskOpen Source Assurance

Page 43: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary 43

Page 44: Why Open Data Science Matters

Reaching Full potential with Open Data Science

• Make “Code to Data” connection seamless and easy• Amplify learning• Shrink time from idea to production• Choose the right algorithms for your data and question• Cultivate and participate in community mission

Page 45: Why Open Data Science Matters

Code to Data

Data Silos are everywhere

Python is great glue to connect the Silos

Same data must be storedtwice in memory for different languages becausethere are not common data-descriptions!

Blaze project still working to solve this!

Page 47: Why Open Data Science Matters

Amplify Learning

Filling this gap with better toolsAnd streamlining education to be a data-scientist

Skills Assessment and Data Science Placement Program

Page 48: Why Open Data Science Matters

Amplify Learning

Skills Assessment and Data Science Placement Program

Formal “post-graduate” independent-study program with “on-the-job” learning and mentoring that takes people from where they are and improves their data-science, data-engineering, and quantitative programming skills.

Contact me if you:1. want to enroll2. want to “contract-to-hire” someone with Anaconda capabilities

Page 49: Why Open Data Science Matters

Shrinking Time from Idea to Production

• collaboration• automation• common platforms• shared abstractions• versioning• authentication• governance• recognize it is iterative

ANACONDAFUSION

ANACONDA ENTERPRISE

ANACONDA

ANACONDANARRATIVES

i.e 5.x version of Anaconda Products!

Page 50: Why Open Data Science Matters

Shrinking Time from Idea to ProductionUsing notebooks as sourcefor dashboards, apps, services, …

Page 51: Why Open Data Science Matters

Right algorithms

• Supervised Learning — uses “labeled” data to train a model• Regression — predicted variable is continuous • Classification — predicted variable is discrete

• Unsupervised Learning• Clustering — discover categories in the data• Density Estimation — determine representation of data • Dimensionality Reduction — represent data with fewer variables or feature vectors

• Reinforcement Learning — “goal-oriented” learning (e.g. drive a car)• Deep Learning — neural networks with many layers• Semi-supervised Learning (use some labeled data for training)

Page 52: Why Open Data Science Matters

Right algorithms

There is no magic solution to your problem!

Page 53: Why Open Data Science Matters

Right algorithms Practical Parallelism for Scale

Ave

rage

Cre

dit C

ard

Pur

chas

e in

US

D

Year

~2300 CSV files with >9million transactions over 8 year from Ashley Madison “hack”

ddf = ddf.repartition(npartitions=100)ndf = ddf.set_index('DATE')ndf.persist()ndf.AMOUNT.resample(‘1M’).mean().compute().plot()

Use read_csv and some transformations in parallel to build a distributed data-frame (with ~2300 partitions, one for each file).

4 nodes with4 cores each

Page 54: Why Open Data Science Matters

Dask feels modern.

Flexible parallelsim

• machine learning• advanced analytics

and modeling• advanced data

munging

all an import away

Right algorithms

Page 55: Why Open Data Science Matters

CULTIVATION of COMMUNITY

Great works are started by a small group usually 1-3 people).

Page 56: Why Open Data Science Matters

CULTIVATION of COMMUNITY

Python community code of Conduct:

A member of the Python community is:• Open• Considerate• Respectful

Page 57: Why Open Data Science Matters

CULTIVATION of COMMUNITY

Page 58: Why Open Data Science Matters

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Continuum AnalyticsWe empower data science teams to make the world a better placeWe Empower Data Science Teams to Make the World Better