easy data science deployment with the anaconda platform

31
© 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary © 2016 Continuum Analytics - Confidential & Proprietary Anaconda – Open Data Science Platform

Upload: continuum-analytics

Post on 11-Apr-2017

261 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda –Open Data Science Platform

Page 2: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Continuum Analytics is the company behind Anaconda and offers:

– Open-Source Software

– Commercial Software

– Training

– Consulting

is….the leading Open Data Science platform powered by Python the fastest growing open data science language

Accelerate, Connect & Empower

Page 3: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Quickly Engage with Your Data

Modern, Open Data Science Platform powered by PythonAnaconda

– 730+ Popular Python & R packages

– Compiled for Windows, Mac, and Linux

– Package Distribution is free for everyone

– Foundation of our Enterprise Platform

– Extensible via Conda Package Manager

– Easily sandbox and deploy packages & analytical computing environments

Page 4: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary 44

Anaconda…is Trusted by Industry Leaders

Financial Services• Risk management, Quant modeling, Data

exploration and processing, algorithmic trading, compliance reporting

Government• Fraud detection, data crawling, web & cyber data

analytics, statistical modelingHealthcare & Life Sciences• Genomics data processing, cancer research,

natural language processing for health data scienceHigh Tech• Customer behavior, recommendations, ad bidding,

retargeting, social media analyticsRetail & CPG• Engineering simulation, supply chain modeling,

scientific analysisOil & Gas• Pipeline monitoring, noise logging, seismic data

processing, geophysics

Page 5: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Env 1

Python 2.7

Conda: Package and Environment ManagementEnv 2

Python 3.4

Pandas v.0.18

Jupyter

Env 3

R

R Essentials

conda

Windows, Mac OSX, Linux

– Install packages

– Update packages

– Create sandboxes: Conda environments

– Conda environments: Critical for reproducibility, collaboration & scale

NumPyv1.11

NumPyv1.10

Pandas v.0.16

Page 6: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary 66

Continuum Sponsored Open-Source Projects

• Bokeh - Interactive Web Visualizations

and Applications

• Dask – Painless distributed and parallel

computations in Python

• Numba - JIT for Python applications

• Jupyter, Spyder – Notebooks and IDE

for data science

• Pandas, Datashader, Blaze, …

Page 7: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary 77

Anaconda• High performance Python &

R• 720+ data science

packages• Cross-platform package,

dependency & environments

• Community driven package repository collaboration

Anaconda Navigator• Desktop Portal & Installer

Anaconda Enterprise Components

OPEN DATA SCIENCE

DATA SCIENCE GOVERNANCE

DATA SCIENCE COLLABORATION

Anaconda Repository• Storage & sharing of

packages, environments, notebooks

• On-premise governance• Enterprise authentication

Anaconda• Deep Learning: Theano,

Tensorflow, Caffe, Keras, Neon, Lasagne

• Natural Language Processing: NLTK, spaCy

• Machine Learning: Scikit-learn

• GPU enablement

Anaconda Enterprise Notebooks

• Collaborative project based workflows for Python & R

• Enterprise authentication & permissioning

• Notebook sharing, versioning, search, differencing

Anaconda• Interactive browser based

dashboards & visualizations with Bokeh

• Bokeh apps using Python, R, Scala

DATA SCIENCE FOR BIG DATA

Anaconda Scale • Hadoop & Spark integration• Scalable distributed

processing framework• Integration with resource

management & data stores• Distributed package,

dependency & environments

Anaconda Fusion• Integration of Open Data

Science with Microsoft Excel®

• Big Data querying & transformations

Page 8: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

On-premises package repository– Governance for your analytics environment– Empower your data scientists within the

structure of enterprise IT

Enterprise notebook collaboration– Easily replicate and share analysts’

environments– Centrally store proprietary libraries and

manage versioning

Scalable analytics computations– Scale up: leverage GPU and parallel-

optimized libraries

– Scale out: easily manage Anaconda across your Hadoop/Spark cluster

– Scale up and out with Python and R

Enterprise data science deployment– Encapsulate and deploy data science projects

– Deploy live notebooks, dashboards, interactive applications, and models with REST APIs

Anaconda EnterpriseOpen Source Without Anxiety: Governance and Scalability

Page 9: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Scaling Out with Anaconda

Page 10: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda - Scaled Out Open Data Science

Application and Visualization Jupyter Notebook, Matplotlib, seaborn, Bokeh, etc.

Analytics pandas, NumPy, SciPy, Numba, scikit-learn, NLTK, scikit-image, PIL, and more

Computation PySpark, SparkR Dask, Distributed

Data and Resource Management HDFS, NFS, YARN, SGE, SLURM

Servers Bare-metal or Cloud-based Cluster Clus

ter

Anac

onda

Page 11: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Spectrum of Parallelization

ThreadsProcesses

MPIZeroMQ

Explicit control: Fast but low-level Implicit control: Restrictive but easy

Dask HadoopSpark

SQL:HivePig

Impala

Page 12: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Scaling Out with Anaconda and Spark

• Using Anaconda with Spark is:• Extensible: Use libraries from Anaconda with PySpark and SparkR jobs• Integrated: Use interactive notebooks with data in HDFS and on YARN

clusters• Secure: Works with Kerberized Hadoop clusters• Scalable: Map pandas, NumPy, SciPy jobs on large clusters and data sets• Seamless: Works with Cloudera CDH, Hortonworks HDP, and other enterprise

Hadoop distributions• Anaconda dramatically simplifies the installation and management of popular

Python and R packages and their dependencies.

Page 13: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Other ways to Scale Out with Anaconda• Anaconda integrates with:• Spark (PySpark, SparkR) and other

Hadoop components, including YARN, HDFS, Hive, Impala, and more

• Dask, Distributed, knit, dask-ec2, hdfs3, fastparquet

• CSV, SQL, JSON, HDF5, Parquet, etc.• Amazon Web Services, Microsoft Azure,

Google Cloud Platform

• Streaming analytics: Streamparse for Apache Storm, Spark Streaming, Kafka, Python integration with ELK

• Anaconda Technology Partners:• Cloudera• Hortonworks• IBM• H2O• Docker• … and more

Page 14: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Scaling Out with Anaconda

Without Anaconda Scale

Head Node1. Manually install Python,

packages & dependencies2. Manually install R, packages &

dependencies

With Anaconda Scale

Compute Nodes1. Manually install Python,

packages & dependencies

2. Manually install R, packages & dependencies

Compute Nodes

Head NodeEasily install Anaconda with performance optimized Python and R packages and manage environments across all nodes in a cluster

Page 15: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Scaling Out with Anaconda –Example Use Cases

Analyzing text, tabular, or array data using Dask

• Use Pandas dataframes orNumPy arrays at scale

• Work with data in different formats and data stores

Distributed natural language processing with text data using PySpark

• Explore data using a distributed memory cluster

• Interactively query and analyze data using libraries from Anaconda

Distributed machine learning workflows with Dask, Spark, H2O, Tensorflow, and more

• Work interactively and collaboratively in notebooks

• Simplify installation and management of ML libraries and dependencies

Handle custom code and workflows usingDask

• Work with custom data formats

• Construct complex pipelines including ETL and flexible computations

Page 16: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Productionizing and Deploying Data Science Projects

Page 17: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Productionizing Data Science Projects

Page 18: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects - Notebooks

Page 19: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects - Dashboards

Page 20: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects – Interactive Applications

Page 21: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Deploying Data Science Projects –Models with REST APIsLoad Data

Clean Data

Anomaly Detection

Models withREST APIs

DashboardsReports

InteractiveApplications

Regression

Clustering

Machine LearningPipeline

Deployed Applications

Developers and data scientists can build additional layers of visualizations, dashboards, or interactive applications that consume data from API endpoints.

Page 22: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Scalable and Deployable Data Science

• … with Anaconda and Anaconda Enterprise, including:• Scaled-up Analytics: Develop and deploy the same code/environments

on your local machine and a cluster

• Environment management: Dynamically manage Python, R, dependencies and other conda packages and environments across a cluster

• Collaboration: Easily share versioned notebooks and projects across users and replicate analysts’ environments for different jobs/users/groups

• Hadoop integration: Support for Hadoop, Spark and other distributed workflows; compatible with enterprise Hadoop distributions

Page 23: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Data Science Deploymentin Anaconda Enterprise

Page 24: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

ENSURES AVAILABILITY,

UPTIME, & MONITORING

PROVISIONS COMPUTE

RESOURCES

MANAGES DEPENDENCIES & ENVIRONMENTS

SHARE COMPUTE

RESOURCESSECURE

NETWORK COMMUNICATIONS

& SSL

SECURE DATA & NETWORK

CONNECTIVITY

ENGINEER FOR SCALABILITY

MANAGE AUTHENTICATION & ACCESS CONTROL

SCHEDULE REGULAR

EXECUTION OF JOBS

With Anaconda Enterprise life just got a whole lot easier…

Learn more: https://www.continuum.io/blog/developer-blog/productionizing-deploying-data-science-projects

Page 25: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary© 2016 Continuum Analytics - Confidential & Proprietary

Anaconda Fusion

Page 26: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary 26

Anaconda Fusion brings Open Data Science to Microsoft Excel

AnacondaFusion

• BRING interactive visualizations, machine learning and ETL to Excel

• BRIDGE Excel Data to Python & R through notebooks

• ACCESS all the power of Python and Big Data, natively embedded inside Excel

Page 27: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Empowering Business Analysts & Data-driven Employees

• Anaconda Fusion is a Microsoft Excel® Add-in that enables a unique and simple link between Excel and Python without writing code

• Anaconda Fusion is targeted to Business Analysts who want “No Code” Data Science

Page 28: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Analysts and Data Scientists can keep using their prefered tools

28

Page 29: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

“No Code” Data Science – Data Loading Example

1 2Select Anaconda Fusion Notebook and click “Upload”

Select function you wish to run

Click “Run” Data is loaded into spreadsheet3 4

Page 30: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

Just change one line of code in your notebook

Page 31: Easy Data Science Deployment with the Anaconda Platform

© 2016 Continuum Analytics - Confidential & Proprietary

• Extract data - pull data directly into Excel to perform analysis

• Machine Learning – use trained models created by Data Scientists and plug them into your spreadsheet data

• Interactive Visualizations – create custom advanced interactive graphs, charts and plots from Excel data

• Big Data – analyze, transform, model and query data stored in Hadoop and Spark

Figure: Anaconda Fusion on Mac

Anaconda Fusion Use Cases