pydata boston 2013

44
Continuum Analytics: What we are doing and why Travis Oliphant, PhD Continuum Analytics, Inc

Upload: travis-oliphant

Post on 27-Jan-2015

111 views

Category:

Technology


1 download

DESCRIPTION

A description of Continuum Analytics projects focusing on the open source work. Includes descriptions of Wakari, Numba, Conda, Bokeh, Blaze, and CDX.

TRANSCRIPT

Page 1: PyData Boston 2013

Continuum Analytics: What we are doing and why

Travis Oliphant, PhDContinuum Analytics, Inc

Page 2: PyData Boston 2013

This talk will be about (mostly) the free and /or open source software we are building.

Enterprise

Python

Scientific

Computing

Data Processing

Data Analysis

Visualisation

Scalable

Computing

• Products• Training• Support• Consulting

Page 3: PyData Boston 2013

Began operations in January of 2012 with 5 people with big dreams

Python

Page 4: PyData Boston 2013

We are big backers of NumFOCUS and organizers of PyData

Spyder

Page 5: PyData Boston 2013

Our TeamScientists Developers

+5 contractors+4 interns

+1 Part-time

Business

Page 6: PyData Boston 2013

Big Picture

expertise

insights

Page 7: PyData Boston 2013

NumPy and SciPy are quite successful

Thanks to large and diverse community around it (Matplotlib, IPython, SymPy, Pandas, etc.)I estimate 1.5 million to 2 million usersOnly incremental improvements possible with these projects at this point.

Thus, we needed to start new projects...

Page 8: PyData Boston 2013

Related Open Source Projects

Blaze: High-performance Python library for modern vector computing, distributed and streaming data

Numba: Vectorizing Python compiler for multicore and GPU, using LLVM

Bokeh: Interactive, grammar-based visualization system for large datasets

Common Thread: High-level, expressive language for domain experts; innovative compilers & runtimes for efficient, powerful data transformation

Page 9: PyData Boston 2013

Conda and Anaconda

• Cross-platform package management• Multiple environments allows you to have multiple

versions of packages installed in system• Easy app-deployment• Taming open-source

Free for all users

Enterprise support available!

Page 10: PyData Boston 2013

Why Conda?

• Linux users stopped complaining about Python deployment

• I made major mistakes in management of NumPy/SciPy: • too much in SciPy (SciPy as distribution) --- scikits model is much

better (tighter libraries developed by smaller teams)• gave in to community desire for binary ABI compatibility in

NumPy motivated by difficulty of reproducing Python install• Need for a cross-platform way to install major Python

extensions (many with dependencies on large C or C++ libraries

• Python can’t be ubiquitous if people struggle to just get it and then manage it.

Page 11: PyData Boston 2013

Why conda

Information Technology

Page 12: PyData Boston 2013

What is Conda

• Full package management (like yum or apt-get) but cross-platform

• Control over environments (using hard-link farms) --- better than virtual-env. virtualenv today is like distutils and setuptools of several years ago (great at first but will end up hating it)

• Architected to be able to manage any packages (R, Scala, Clojure, Haskell, Ruby, JS)

• SAT solver to manage dependencies• User-definable repositories

Page 13: PyData Boston 2013

New Features and Binstar• Build command from recipe --- many recipes here: https://

github.com/ContinuumIO/conda-recipes• Upload recipes to Binstar (last mile in binary package

hosting and deployment for any language). • “binstar in beta” is the beta code• Personal conda repositories --- https://conda.binstar.org/

travis• Free Continuum Anaconda repo will be on binstar.org• Private packages and behind-the-firewall satellites available• *Free build queue on Linux (Mac and Windows coming

soon) for hosted conda recipes

Page 14: PyData Boston 2013

Demo

create Python 3 environment with IPython and scipy

create new recipe from PyPI (yunomi)

Page 15: PyData Boston 2013

Packaging and Distribution Solved

• conda and binstar solve most of the problems that we have seen people encounter in managing Python installations (especially in large-scale institutions).

• they are supported solutions that can remove the technology pain of managing Python

• some problems, though, are people

Page 16: PyData Boston 2013

Anaconda (open)Free enterprise-ready Python distribution of open-

source tools for large-scale data processing, predictive analytics, and scientific computing

Page 17: PyData Boston 2013

Anaconda Add-Ons (paid-for)

•Revolutionary Python to GPU compiler•Extends Numba to take a subset of Python to the GPU (program CUDA in Python)

•CUDA FFT / BLAS interfaces

Fast, memory-efficient Python interface for SQL databases, NoSQL stores, Amazon S3, and large data files.

NumPy, SciPy, scikit-learn, NumExpr compiled against Intel’s Math Kernel Library (MKL)

Page 18: PyData Boston 2013

Launcher

Page 19: PyData Boston 2013

Why Numba?• Python is too slow for loops•Most people are not learning C/C++/Fortran today•Cython is an improvment (but still verbose and

needs C-compiler)•NVIDIA using LLVM for the GPU•Many people working with large typed-containers

(NumPy arrays)•We want to take high-level, tarray-oriented

expressions and compile it to fast code

Page 20: PyData Boston 2013

NumPy + Mamba = Numba

LLVM Library

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVMPY

Python Function Machine Code

ARM

Page 21: PyData Boston 2013

Example

Numba

Page 22: PyData Boston 2013

Numba

@jit('void(f8[:,:],f8[:,:],f8[:,:])')def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result

~1500x speed-up

Page 23: PyData Boston 2013

Numba changes the game!

LLVM IR

x86C++

ARM

PTX

C

Fortran

Python

Numba turns (a subset of) Python into a “compiled language” as fast as C (but much more

flexible). You don’t have to reach for C/C++

Page 24: PyData Boston 2013

Laplace Example

@jit('void(double[:,:], double, double)')def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2))

Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran

@jit('void(double[:,:], double, double)')def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))

Page 25: PyData Boston 2013

Results of Laplace example

Version Time Speed UpNumPy 3.19 1.0Numba 2.32 1.38

Vect. Numba 2.33 1.37Cython 2.38 1.34Weave 2.47 1.29

Numexpr 2.62 1.22Fortran Loops 2.30 1.39Vect. Fortran 1.50 2.13

https://github.com/teoliphant/speed.git

Page 26: PyData Boston 2013

LLVMPy worth looking at

LLVM (via LLVMPy) has done

much heavy lifting

LLVMPy = Compilers for

everybody

Page 27: PyData Boston 2013

What is wrong with NumPy?• Dtype system is difficult to extend• Many Dtypes needed (missing data, enums, variable length

strings)• Immediate mode creates huge temporaries (spawning

Numexpr)• “Almost” an in-memory data-base comparable to SQL-

lite (missing indexes)• Integration with sparse arrays • Standard structure of arrays representation...• Missing Multi-methods• Optimization / Minimal support for multi-core / GPU

Page 28: PyData Boston 2013

Now What?

After watching NumPy and SciPy get used all over Wall Street and by many scientists / engineers in

industry --- what would we do differently?

Page 29: PyData Boston 2013

New Project

Blaze

NumPy

Out of Core,Distributed and Optimized

NumPy

Page 30: PyData Boston 2013

Blaze Array or Table

Data Descriptor

Data BufferIndex Operation

NumPy BLZPersistent Format

RDBMS

CSVData Stream

Page 31: PyData Boston 2013

Blaze Deferred Arrays

+"

A" *"

B" C"

A + B*C

• Symbolic objects which build a graph• Represents deferred computation

Usually what you have when you have a Blaze Array

Page 32: PyData Boston 2013

DataShape Type System

• A data description language• A super-set of NumPy’s dtype• Provides more flexibility

Shape DType

DataShape

Page 33: PyData Boston 2013

Blaze

Database

GPU Node

Array Server

NFS

Array Server

Array Server

Blaze Client

SynthesizedArray/Table view

array+sql://

array://

file:// array://

Python REPL, Scripts

Viz Data Server

C, C++, FORTRAN

JVM languages

Page 34: PyData Boston 2013

Progress

• Basic calculations work out-of-core (via Numba and LLVM)

• Hard dependency on dynd and dynd-python (a dynamic C++-only multi-dimensional library like NumPy but with many improvements)

• Persistent arrays from BLZ• Basic array-server functionality for layering over CSV

files• 0.2 release in 1-2 weeks. 0.3 within a month after that

(first usable release)

Page 35: PyData Boston 2013

DARPA providing help

DARPA-BAA-12-38: XDATA

TA-1: Scalable analytics and data processing technology  TA-2: Visual user interface technology

Page 36: PyData Boston 2013

Bokeh Plotting Library

• Interactive graphics for the web• Designed for large datasets• Designed for streaming data• Native interface in Python• Fast JavaScript component• DARPA funded• v0.1 release imminent

Page 37: PyData Boston 2013

Reasons for Bokeh

1. Plotting must happen near the data too2. Quick iteration is essential => interactive visualization3. Interactive visualization on remote-data => use the browser4. Almost all web plotting libraries are either:

1. Designed for javascript programmers 2. Designed to output static graphs

5. We designed Bokeh to be dynamic graphing in the web for Python programmers

6. Will include “Abstract” or “synthetic” rendering (working on Hadoop and Spark compatibility)

Page 38: PyData Boston 2013

Wakari

• Browser-based data analysis and visualization platform

• Wordpress / YouTube / Github for data analysis

• Full Linux environment with Anaconda Python

• Can be installed on internal clusters & servers

Page 39: PyData Boston 2013

Why Wakari?• Data is too big to fit on your desktop • You need compute power but don’t have easy access to a

large cluster (cloud is sitting there with lots of power)• Configuration of software on a new system stinks

(especially a cluster).• Collaborative Data Analytics --- you want to build a

complex technical workflow and then share it with others easily (without requiring they do painful configuration to see your results)

• IPython Notebook is awesome --- let’s share it (but we also need the dependencies and data).

Page 40: PyData Boston 2013

Wakari

• Free account has 512 MB RAM / 10 GB disk and shared multi-core CPU

• Easily spin-up map-reduce (Disco and Hadoop clusters)• Use IPython Parallel on many-nodes in the cloud• Develop GUI apps (possibly in Anaconda) and publish

them easily to Wakari (based on full power of scientific python --- complex technical workflows (IPython notebook for now)

Page 41: PyData Boston 2013

Basic Data Explorer

Page 42: PyData Boston 2013

Continuum Data Explorer (CDX)

• Open Source • Goal is interactivity• Combination of IPython REPL, Bokeh, and tables• Tight integration between GUI elements and REPL• Current features

- Namespace viewer (mapped to IPython namespace)- DataTable widget with group-by, computed columns, advanced-

filters- Interactive Plots connected to tables

Page 43: PyData Boston 2013

CDX

Page 44: PyData Boston 2013

Conclusion

Projects circle around giving tools to experts (occasional programmers or domain experts) to enable them to move their expertise to the data to get insights --- keep data where it is and move high-level but performant code)

Join us or ask how we can help you!