pythoninhadoopecosystem blazeandbokeh - startseite ·...

Post on 20-May-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Python in Hadoop Ecosystem Blaze and Bokeh

Presented by: Andy R. Terrel

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

About Continuum Analytics

Intro Large scale data analytics Interactive data visualization A practical example

http://continuum.io/

We build technologies that enable analysts and data scientist to

answer questions from the data all around us.

Committed to Open Source

Areas of Focus• Software solutions• Consulting• Training

• Anaconda: Free Python distribution• Numba, Conda, Blaze, Bokeh, dynd• Sponsor

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

About Andy

Andy R. Terrel@aterrelChief Scientist,Continuum Analytics

President,NumFOCUS

Background: • High Performance Computing• Computational Mathematics• President, NumFOCUS foundation

Experience analyzing diverse datasets:• Finance• Simulations• Web data• Social media

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

About this talk

Visualizing Data with Blaze and Bokeh

1. Discussion of Hadoop

2. Large scale data analytics - Blaze

3. Interactive data visualization - Bokeh

Intro Large scale data analytics Interactive data visualization A practical example

Introduction to large-scale data analytics and interactive visualization

Objective

Structure

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Large scale data analytics - An Overview

BI - DB DM/Stats/ML

Scientific ComputingDistributed SystemsNumba

bcolz

RHadoop

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 6

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 7

Base Hadoop Stack

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 8

Berkeley Data Science Stack

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 9

Where is Python?

Hadoop StreamingPySpark

Pig

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 10

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

“ At my company X, we have peta/terabytes of data, just lying around, waiting for someone to explore it”

- someone at PyTexas

Let’s make it easier for users to explore and extract useful insights out of data.

Package manager

Free enterprise-ready Python distributionAnaconda

Conda

Blaze

Bokeh

Numba

Wakari

Power to speed up

Share and deploy

Interactive data visualizations

Scale

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 12

Blaze

• Dealing with data applications has numerous pain points

- Hundreds of data formats- Basic programs expect all data to fit in memory- Data analysis pipelines constantly changing from one form to another- Sharing analysis contains significant overhead to configure systems- Parallelizing analysis requires expert in particular distributed computing

stack

Data Pain

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Blaze

Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Distributed Systems Scientific Computing

BI - DB DM/Stats/ML

Blaze

bcolz

Connecting technologies to usersConnecting technologies to each other

Blaze

hdf5

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame

Intro Large scale data analytics Interactive data visualization A practical example

HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemyjson

Blaze

DeferredExpr

Com

pile

rsIn

terp

rete

rs Data

Compute

API

Blaze Architecture

• Flexible architecture to accommodate exploration

• Use compilation of deferred expressions to optimize data interactions

DeferredExpr

Blaze Expr

temps.hdf5nasdaq.sql

tweets.json

Joinby date

SelectNYC

FindTech Selloff

Plot• Lazy computation

to minimize data movement

• Simple DAG for compilation to• parallel

application• distributed

memory• static

optimizations

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemyjson

Blaze.expressions

Blaze Data

• Single interface for data layers

• Composition of differentformats

• Simple api to add custom data formats

SQL

CSV

HDFS

JSON

Mem

CustomHDF5

Data

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemyjson

Blaze.data

Blaze Compute

Compute

DyND Pandas

PyTablesSpark

• Computation abstraction over numerous data libraries

• Simple multi-dispatched visitors to implement new backends

• Allows plumbing between stacks to be seamless to user

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemyjson

Blaze.compute

Blaze Example - Counting WeblinksCommon Blaze Code

#  Expr  t_idx  =  TableSymbol('{name:  string,                                              node_id:  int32}')  t_arc  =  TableSymbol('{node_out:  int32,                                              node_id:  int32}')  joined  =  Join(t_arc,  t_idx,  "node_id")  t  =  By(joined,  joined['name'],                  joined['node_id'].count())  

#  Data  Load  idx,  arc  =  load_data()

#  Computations  ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})in_deg  =  dict(ans)  in_deg[u'blogspot.com']

Blaze Example - Counting Weblinks

Using Spark + HDFS

load_data

sc  =  SparkContext("local",  "Simple  App")  idx  =  sc.textFile(“hdfs://master.continuum.io/example_index.txt”)  idx  =  idx.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [x[0],  int(x[1])])  arc  =  sc.textFile("hdfs://master.continuum.io/example_arcs.txt")  arc  =  arc.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [int(x[0]),  int(x[1])])  

Using Pandas + Local Discwith  open("example_index.txt")  as  f:          idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]  idx  =  DataFrame(idx,  columns=['name',  'node_id'])  

with  open("example_arcs.txt")  as  f:          arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]  arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Blaze.API

TableUsing the interactive Table

object we can interact with a variety of computational

backends with the familiarity of a local DataFrame

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 27

Intro Large scale data analytics Interactive data visualization A practical example

Blaze.API

Table

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Blaze.API

Migrations - intothe into function makes it easy to moves data from one container

type to another

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 29

Blaze notebooks

Intro Large scale data analytics Interactive data visualization A practical example

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Why I like using Blaze?

- Syntax is very similar to Pandas- Easy to scale- Easy to find best computational backend to a particular dataset- Easy to adapt my code if someone handles me a dataset in a different format/

backend

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Want to learn more about Blaze?Free Webinar:http://www.continuum.io/webinars/getting-started-with-blaze

Blogpost:http://continuum.io/blog/blaze-expressionshttp://continuum.io/blog/blaze-migrationshttp://continuum.io/blog/blaze-hmda

Docs and source code:http://blaze.pydata.org/https://github.com/ContinuumIO/blaze

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Data visualization - An OverviewResults presentation Visual analyticsStatic InteractiveSmall datasets Large datasetsTraditional plots Novel graphics

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Bokeh• Interactive visualization

• Novel graphics

• Streaming, dynamic, large data

• For the browser, with or without a server

• Matplotlib compatibility

• No need to write Javascripthttp://bokeh.pydata.org/https://github.com/ContinuumIO/bokeh

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Intro Large scale data analytics Interactive data visualization A practical example

Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel Zoom, Save, Resize, Select, Reset View)

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 35

Intro Large scale data analytics Interactive data visualization A practical example

Bokeh - Interactive, Visual analytics • Widgets and dashboards

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 36

Bokeh - Interactive, Visual analytics

Intro Large scale data analytics Interactive data visualization A practical example

• Crossfilter

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 37

Bokeh - Large datasets Server-side downsampling and abstract rendering

Intro Large scale data analytics Interactive data visualization A practical example

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014 38

Bokeh - No JavaScript

Intro Large scale data analytics Interactive data visualization A practical example

Python and Hadoop with Blaze and Bokeh, SC14 / PyHPC 2014

Thank you! :)

top related