continuum analytics and python

Continuum Analytics and Python

Travis Oliphant CEO, Co-Founder Continuum Analytics

ABOUT CONTINUUM

2

Travis Oliphant - CEO

3

• PhD 2001 from Mayo Clinic in Biomedical Engineering

• MS/BS degrees in Elec. Comp. Engineering • Creator of SciPy (1999-2009) • Professor at BYU (2001-2007) • Author of NumPy (2005-2012) • Started Numba (2012) • Founding Chair of Numfocus / PyData • Previous PSF Director

SciPy

Started as a Scientist / Engineer

4

Images from BYU CERS Lab

Science led to Python

5

Raja Muthupillai

Armando Manduca

Richard Ehman Jim Greenleaf

1997

⇢0 (2⇡f)2 Ui (a, f) = [Cijkl (a, f)Uk,l (a, f)],j ⌅ = r⇥U

“Distractions” led to my calling

6

7

Data Science

8

• Data volume is growing exponentially within companies. Most don't know how to harvest its value or how to really compute on it.

• Growing mess of tools, databases, and products. New products increase integration headaches, instead of simplifying.

• New hardware & architectures are tempting, but are avoided or sit idle because of software challenges.

• Promises of the "Big Red Solve" button continue to disappoint.(If someone can give you this button, they are your competitor.)

Data Scientist Challenges

Our Solution

9

• A language-based platform is needed. No simple point-and-click app is enough to solve these business problems, which require advanced modeling and data exploration.

• That language must be powerful, yet still be accessible to domain experts and subject matter experts.

• That language must leverage the growing capability and rapid innovation in open-source.

Anaconda Platform: Enterprise platform for Data Exploration, Advanced Analytics, and Rapid Application Development and Deployment (RADD)

Harnesses the exploding popularity of the Python ecosystem that our principals helped create.

Why Python?

10

Analyst

• Uses graphical tools • Can call functions,

cut & paste code• Can change some

variables

Gets paid for: Insight

Excel, VB, Tableau,

Analyst / Data Developer

• Builds simple apps & workflows• Used to be "just an analyst" • Likes coding to solve problems• Doesn't want to be a "full-time

programmer"

Gets paid (like a rock star) for: Code that produces insight

SAS, R, Matlab,

Programmer

• Creates frameworks & compilers

• Uses IDEs • Degree in CompSci• Knows multiple

languages

Gets paid for: Code

C, C++, Java, JS,

Python Python Python

Python Is Sweeping Education

11

Tools used for Data

12

Source: O’Reilly Strata attendee survey 2012 and 2013

Python for Data Science

13

http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science

http://readwrite.com/2013/11/25/python-displacing-r-as-the-programming-language-for-data-science

Python is the top language in schools!

14

OUR CUSTOMERS & OUR MARKET

15

Some Users

16

Anaconda: Game-changing Python distribution

17

"Hands down, the best Scientific Python distribution products for analytics and machine learning."

"Thanks for such a great product, been dabbling in python for years (2002), always knew it was going to be massive as it hit the sweet spot in so many ways, with llvm-numba its magic!!!"

"It was quick and easy and had everything included, even a decent IDE. I'm so grateful it brings tears to my eyes."

"I love Anaconda more than the sun, moon, and stars…"

Anaconda: Game-changing Python distribution

18

• 2 million downloads in last 2 years • 200k / month and growing • conda package manager serves up 5 million packages

per month • Recommended installer for IPython/Jupyter, Pandas,

SciPy, Scikit-learn, etc.

Conferences & Community

19

• PyData: London, Berlin, New York, Bay Area, Seattle • Strata: Tutorials, PyData Track • PyCon, Scipy, EuroScipy, EuroPython, PyCon Brasil... • Spark Summit • JSM, SIAM, IEEE Vis, ICML, ODSC, SuperComputing

Observations & Off-the-record

20

• Hype is out of whack with reality • Dashboards are "old school" BI, but still important for

narrative/confirmatory analysis • Agility of data engineering and exploration is critical

• Get POCs out faster; iterate faster on existing things • Need cutting-edge tools, but production is hard

• Notebooks, reproducibility, provenance - all matter

21http://tuulos.github.io/sf-python-meetup-sep-2013/#/

Data Science Platforms

22

• All-in-one new "platform" startups are walled gardens • Cloud vendor native capabilities are all about lock-in:

"Warehouse all your data here!" • Machine Learning and Advanced Analytics is too early,

and disrupting too fast, to place bets on any single walled garden. • Especially since most have no experience with

"exotic" regulatory and security requirements

Good News

23

You can have a modern, advanced analytics system that integrates well with your infrastructure

Bad NewsIt's not available as a SKU from any vendor.

META-PLATFORM CONSTRUCTION KIT

Great News

24

• If done well, it adds deep, fundamental business capability • Many wall street banks and firms using this.

• All major Silicon Valley companies know this • Facebook, LinkedIn, Uber, Tesla, SpaceX, Netflix, ...

EXAMPLE PROJECTS

25

26

Bitcoin Dataset Bitcoin is a digital currency invented in 2008 and operates on a peer-to-peer system for transaction validation. This

decentralized currency is an attempt to mimic physical currencies in that there is limited supply of Bitcoins in the world,

each Bitcoin must be “mined”, and each transaction can be verified for authenticity. Bitcoins are used to exchange

every day goods and services, but it also has known ties to black markets, illicit drugs, and illegal gambling transactions.

The dataset is also very inclined towards anonymization of behavior, though true anonymization is rarely achieved.

The Bitcoin Dataset The Bitcoin dataset was obtained from http://compbio.cs.uic.edu/data/bitcoin/ and captures transaction-level

information. For each transaction, there can be multiple senders and multiple receivers as detailed here:

https://en.bitcoin.it/wiki/Transactions. This dataset provides a challenge in that multiple addresses are usually

associated with a single entity or person. However, some initial work has been done to associated keys with a single

user by looking at transactions that are associated with each other (for example, if a transaction has multiple public keys

as input on a single transaction, then a single user owns both private keys). The dataset provided provides these known

associations by grouping these addresses together under a single UserId (which then maps to a set of all associated

addresses).

Key Challenge Questions

1. Can we detect bulk Bitcoin thefts by hackers? Can we track where the money went after thefts? 2. Can we detect illicit transactions based on Bitcoin transaction behavior? What sort of graph patterns emerge? 3. Can we detect attempts at money laundering (called a “mixing service” in Bitcoin)

a. Can we detect money laundering attempts and the people who use them? Note: Current Bitcoin mixing services tend to mix Bitcoins amongst all the people who bother to use a mixing service – so does the mixing service actually obfuscate anything? b. Can we trace back the originator of these laundering attempts?

4. Can we detect currency manipulation (hackers try to destabilize Bitcoin currency exchanges to deflate prices) 5. Is Bitcoin gaining traction or losing traction among the regular population for use as a regular digital currency? 6. It is Bitcoin best practice to generate and use a new address with every transaction. Is this practice followed? If not, then what can we learn from this? 7. Can we identify and extract organizational behavior amidst the Bitcoin transactions? 8. Can we determine which Bitcoin addresses belong to a single entity? While the initial pass over the data have yielded some resolution of entities, can we further improve this mapping?

# Transactions: 15.8 Million+ # Edges: 37.4 Million + # Senders: 5.4 Million+ # Receivers: 6.3 Million+ # Bitcoins Transacted: 1.4 Million +

Bitcoin Data Set Overview (May 15, 2013)

Figure 1: Bitcoin Transactions Over Time

Bitcoin Blockchain

Microcap Stock Fraud

27

Memex Dark/Deep Web Analytics

28

TECH & DEMOS

29

Data Science @ NYT

30

@jakevdp

eSciences Institute, Univ. Washington

31

conda cross-platform, multi-language package & container tool

bokeh interactive web plotting for Python, R; no JS/HTML required

numba JIT compiler for Python & NumPy, using LLVM, supports GPU

blaze deconvolve data, expression, and computation; data-web

dask lightweight, fast, Pythonic scheduler for medium data

xray easily handle heterogeneously-shaped dense arrays

holoviews slice & view dense cubes of data in the browser

seaborn easy, beautiful, powerful statistical plotting

beaker polyglot alternative Notebook-like project

32

• Databricks Canvas • Graphlab Create • Zeppelin • Beaker • Microsoft AzureML • Domino • Rodeo? Sense? • H2O, DataRobot, ...

Notebooks Becoming Table Stakes

http://sharing.beakernotebook.com/gist/anonymous/3652f76dca8d4d0681d8


















































34

"With more than 200,000 Jupyter notebooks already on GitHub we're excited to level-‐up the GitHub-‐Jupyter experience."

Anaconda

35

✦ Centralized analytics environment • browser-based interface • deploys on existing infrastructure

✦ Collaboration • cross-functional teams using same data and software

✦ Publishing • code • data • visualizations

Bokeh

36

http://bokeh.pydata.org

• Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript

http://bokeh.pydata.org

Versatile Plots

37

Novel Graphics

38

Previous: Javascript code generation

39

server.py Browser

js_str = """ <d3.js><highchart.js><etc.js>"""

plot.js.template

App Model

D3highchartsflotcrossfilteretc. ...

One-shot; no MVC interaction; no data streaming

HTML

bokeh.py & bokeh.js

40

server.py BrowserApp Model

BokehJS object graph

bokeh-serverbokeh.pyobject graph

JSON

41

http://bokeh.pydata.org/en/latest/docs/server_gallery/stocks_server.html





















































42

4GB Interactive Web Viz

http://demo.bokehplots.com:8888/ocean

rBokeh

43http://hafen.github.io/rbokeh

http://hafen.github.io/rbokeh

45

http://slideviewer.herokuapp.com/url/ahmedmoustafa.io/notebooks/microarray.ipynb#/14

46

http://nbviewer.ipython.org/github/birdsarah/pycon_2015_bokeh_talk/blob/master/notebooks/Ipython%20interactive.ipynb

47

hBp://nbviewer.ipython.org/github/bokeh/bokeh-‐notebooks/blob/master/tutorial/00 -‐ intro.ipynb#InteracHon

Additional Demos & Topics

48

• Airline flights • Pandas table • Streaming / Animation • Large data rendering

49

Latest Cosmological Theory

50

Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional data store

51

Dark Data: CSV, hdf5, npz, logs, emails, and other files in your company outside a traditional data store

52

Database Approach

Data Sources

Data Store

Data Sources

Clients

53

Bring the Database to the Data

Data Sources

Data Sources

ClientsBlaze (datashape, dask)

Num

Py, Pandas, SciPy, sklearn, etc.

(for analytics)

Anaconda — portable environments

54

PYTHON'&'R'OPEN'SOURCE'ANALYTICS

NumPy, SciPy, Pandas, Scikit=learn, Jupyter / IPython,

Numba, Matplotlib, Spyder, Numexpr, Cython, Theano,

Scikit=image, NLTK, NetworkX, IRKernel, dplyr, shiny,

ggplot2, tidyr, caret, nnet and 330+ packages

conda

Easy to install Quick & agile data exploration Powerful data analysis Simple to collaborate Accessible to all

55

• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow • Data glue for scale-up or scale-out • Generic remote computation & query system • (NumPy+Pandas+LINQ+OLAP+PADL).mashup()

Blaze is an extensible interface for data analytics. It feels like NumPy/Pandas. It drives other data systems. Blaze expressions enable high-level reasoning

http://blaze.pydata.org

Blaze

http://blaze.pydata.org

56

Blaze

?

57

Expressions

Metadata Ru

ntime

Blaze

58

Blaze+ - / * ^ []

join, groupby, filtermap, sort, take

where, topk

datashape, dtype,

shape, stride

hdf5, json, csv, xls

protobuf, avro, ...

Num

Py, P

anda

s, R,

Julia

, K, S

QL, S

park

,

Mon

go, C

assa

ndra

, ...

59

numpy

pandas

sql DB

Data Runtime Expressions

spark

data

shap

e

metadata storage

odo

paralleloptimized

dask

numbaDyND

blaz

e

castra

bcolz

60

Data Runtime

Expressions

metadata

storage/containers

compute

APIs, syntax, language

datashape

blaze

daskodo

parallelize optimize, JIT

62

Blaze ServerProvide RESTful web API over any data supported by Blaze.

Server side:

>>> my_spark_rdd = … >>> from blaze import Server >>> Server(my_spark_rdd).run() Hosting computation on localhost:6363

Client Side:

$ curl -H "Content-Type: application/json" \ -d ’{"expr": {"op": "sum", "args": [ ... ] }’ my.domain.com:6363/compute.json

• Quickly share local data to collaborators on the web.

• Expose any system (Mongo, SQL, Spark, in-memory) simply

• Share local computation as well, sending computations to server to run remotely.

• Conveniently drive remote server with interactive Blaze client

63

Dask: Out-of-Core Scheduler• A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python

Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality • Distributed scheduler coming

Example: Ocean Temp Data

64

• http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.highres.html

• Every 1/4 degree, 720x1440 array each day

http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.highres.html

Bigger data...

65

36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressedIf you don't have this much RAM...

... better start chunking.

DAG of Computation

66

Simple Architecture

67

Core Concepts

68

dask.array: OOC, parallel, ND array

69

Arithmetic: +, *, ...

Reductions: mean, max, ...

Slicing: x[10:, 100:50:-2]Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svdParallel algorithms (approximate quantiles, topk, ...)

Slightly overlapping arrays

Integration with HDF5

dask.dataframe: OOC, parallel, ND array

70

Elementwise operations: df.x + df.yRow-wise selections: df[df.x > 0] Aggregations: df.x.max()groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts()Drop duplicates: df.x.drop_duplicates()Join on index: dd.merge(df1, df2, left_index=True, right_index=True)

More Complex Graphs

71

cross validation

72

http://continuum.io/blog/xray-dask

http://www.apple.com

73

from dask import dataframe as dd columns = ["name", "amenity", "Longitude", "Latitude"] data = dd.read_csv('POIWorld.csv', usecols=columns) with_name = data[data.name.notnull()] with_amenity = data[data.amenity.notnull()] is_starbucks = with_name.name.str.contains('[Ss]tarbucks') is_dunkin = with_name.name.str.contains('[Dd]unkin')

starbucks = with_name[is_starbucks] dunkin = with_name[is_dunkin]

locs = dd.compute(starbucks.Longitude, starbucks.Latitude, dunkin.Longitude, dunkin.Latitude)

# extract arrays of values fro the series: lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]

%matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap

def draw_USA(): """initialize a basemap centered on the continental USA""" plt.figure(figsize=(14, 10)) return Basemap(projection='lcc', resolution='l', llcrnrlon=-119, urcrnrlon=-64, llcrnrlat=22, urcrnrlat=49, lat_1=33, lat_2=45, lon_0=-95, area_thresh=10000)

m = draw_USA() # Draw map background m.fillcontinents(color='white', lake_color='#eeeeee') m.drawstates(color='lightgray') m.drawcoastlines(color='lightgray') m.drawcountries(color='lightgray') m.drawmapboundary(fill_color='#eeeeee')

# Plot the values in Starbucks Green and Dunkin Donuts Orange style = dict(s=5, marker='o', alpha=0.5, zorder=2) m.scatter(lon_s, lat_s, latlon=True, label="Starbucks", color='#00592D', **style) m.scatter(lon_d, lat_d, latlon=True, label="Dunkin' Donuts", color='#FC772A', **style) plt.legend(loc='lower left', frameon=False);

74

• Dynamic, just-in-time compiler for Python & NumPy • Uses LLVM • Outputs x86 and GPU (CUDA, HSA) • (Premium version is in Accelerate product)

http://numba.pydata.org

Numba

http://numba.pydata.org

Python Compilation Space

75

Ahead Of Time Just In Time

Relies on CPython / libpython

Cython Shedskin

Nuitka (today) Pythran

NumbaHOPE

Theano

Replaces CPython / libpython

Nuitka (future) Pyston PyPy

Example

76

Numba

77

@jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result

~1500x speed-up

Features

78

• Windows, OS X, and Linux • 32 and 64-bit x86 CPUs and NVIDIA GPUs • Python 2 and 3 • NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter

(all of your existing Python libraries are still available)

How Numba Works

79

Bytecode Analysis

Python Function(bytecode)

Function Arguments

Type Inference

Numba IR

Rewrite IR

Lowering

LLVM IRLLVM JITMachine Code

@jitdef do_math(a,b): …>>> do_math(x, y)

Cache

Execute!

THE ANACONDA PLATFORM

80

Anaconda — portable environments

81

PYTHON'&'R'OPEN'SOURCE'ANALYTICS

NumPy, SciPy, Pandas, Scikit=learn, Jupyter / IPython,

Numba, Matplotlib, Spyder, Numexpr, Cython, Theano,

Scikit=image, NLTK, NetworkX, IRKernel, dplyr, shiny,

ggplot2, tidyr, caret, nnet and 330+ packages

conda

Easy to install Quick & agile data exploration Powerful data analysis Simple to collaborate Accessible to all

82

• cross platform package manager • can create sandboxes ("environments"), akin to

Windows Portable Applications or WinSxS • "un-container" for deploying data science/data

processing workflows

http://conda.pydata.org

Conda

http://conda.pydata.org

System Package Managers

83

yum (rpm)

apt-get (dpkg)

Linux OSX

macports

homebrew

fink

Windows

chocolatey

npackd

Cross-platform

conda

84

• Excellent support for “system-level” environments — like having mini VMs but much lighter weight than docker (micro containers)

• Minimizes code-copies (uses hard/soft links if possible) • Simple format: binary tar-ball + metadata • Metadata allows static analysis of dependencies • Easy to create multiple “channels” which are repositories for packages • User installable (no root privileges needed) • Integrates very well with pip • Cross Platform

Conda features

Anaconda Cloud: analytics repository

85

• Commercial long-term support • Licensed for redistribution • Private, on-premises available • Proprietary tools for building custom

distribution, like Anaconda • Enterprise tools for managing custom

packages and environments • http://anaconda.org

http://anaconda.org

Anaconda Cluster: Anaconda + Hadoop + Spark

86

For data scientists: • Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner • Manage Python, R, Java, JS packages across the cluster

For operations & IT: • Robustly manage runtime state across the cluster

• Outside the scope of rpm, chef, puppet, etc. • Isolate/sandbox packages & libraries for different jobs or groups of users

• Without introducing complexity of Docker / virtualization • Cross platform - same tooling for laptops, workstations, servers, clusters

Cluster Creation

87

$ conda cluster create mycluster --profile=spark_profile$ conda cluster submit mycluster mycode.py

$ conda cluster destroy mycluster

spark_profile: provider: aws_east num_nodes: 4 node_id: ami-3c994355 node_type: m1.large

aws_east: secret_id: <aws_access_key_id> secret_key: <aws_secret_access_key> keyname: id_rsa.pub location: us-east-1 private_key: ~/.ssh/id_rsa cloud_provider: ec2 security_group: all-open

http://continuumio.github.io/conda-cluster/quickstart.html

http://continuumio.github.io/conda-cluster/quickstart.html

88

$ conda cluster manage mycluster list... info -e... install python=3 pandas flask... set_env... push_env <local> <remote>

$ conda cluster ssh mycluster$ conda cluster run.cmd mycluster "cat /etc/hosts"

Package & environment management:

Easy SSH & remote commands:

http://continuumio.github.io/conda-cluster/manage.html

Cluster Management

http://continuumio.github.io/conda-cluster/manage.html

Anaconda Cluster & Spark

89

# example.pyconf = SparkConf()conf.setMaster("yarn-client")conf.setAppName("MY APP")sc = SparkContext(conf=conf)# analysissc.parallelize(range(1000)).map(lambda x: (x, x % 2)).take(10)

$ conda cluster submit MY_CLUSTER /path/to/example.py

Python & Spark in Practice

90

Challenges of real-world usage • Package management (perennial popular topic in Python) • Python (& R) are outside the "normal" Java build toolchain • bash scripts, spark jobs to pip install or conda install <foo> • Kind of OK for batch; terrible for interactive

• Rapid iteration • Production vs dev/test clusters • Data scientist needs vs Ops/IT concerns

Fix it twice…

91

PEP 3118: Revising the buffer protocol

Basically the “structure” of NumPy arrays as a protocol in Python itself to establish a

memory-sharing standard between objects.

It makes it possible for a heterogeneous world of powerful array-like objects outside of NumPy that

communicate.

Falls short in not defining a general data description language (DDL).

http://python.org/dev/peps/pep-3118/

http://python.org/dev/peps/pep-3118/

Putting Rest of NumPy in Std Lib

92

• memtype• dtype system on memory-views• extensible with Numba and C• extensible with Python

• gufunc• generalized low-level function dispatch on memtype• extensible with Numba and C • usable by any Python

Working on now with a (small) team — could use funding

93

• Python has had a long and fruitful history in Data Analytics • It will have a long and bright future with your help! • Contribute to the PyData community and make the world a

better place!

The Future of Python

© 2015 Continuum Analytics- Confidential & Proprietary

Thanks

October1, 2015

•SIG for hosting tonight and inviting me to come •DARPA XDATA program (Chris White and Wade Shen) which helped fund Numba, Blaze, Dask and Odo.

•Investors of Continuum. •Clients and Customers of Continuum who help support these projects. •Numfocus volunteers •PyData volunteers

continuum analytics and python

Technology