Download - Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Sparkling PandasScaling Pandas beyond a single machine

(or letting Pandas Roam)With Special thanks to Juliet Hougland :)

Who am I?

Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Engineer at Alpine Data Labs

○ previously DataBricks, Google, Foursquare, Amazon● @holdenkarau● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau

https://twitter.com/holdenkarau

https://twitter.com/holdenkarau

http://www.slideshare.net/hkarau

http://www.slideshare.net/hkarau

https://www.linkedin.com/in/holdenkarau

https://www.linkedin.com/in/holdenkarau

What is Pandas?

user_id panda_type

01234 giant

12345 red

23456 giant

34567 giant

45678 red

56789 giant

● DataFrames--Indexed, tabular data structures● Easy slicing, indexing, subsetting/filtering● Excellent support for time series data● Data alignment and reshaping

http://pandas.pydata.org/



What is Spark?

Fast general engine for in memory data processing.

tl;dr - 100x faster than Hadoop MapReduce*

The different pieces of Spark

Apache Spark

SQL & DataFrames Streaming Language

APIs

Scala, Java, Python, & R

Graph Tools

Spark ML bagel & Grah X

MLLib Community Packages

Some Spark terms

Spark Context (aka sc)● The window to the world of SparksqlContext● The window to the world of DataFramesTransformation● Takes an RDD (or DataFrame) and returns a new RDD

or DataFrameAction● Causes an RDD to be evaluated (often storing the

result)

Dataframes between Spark & Pandas

Spark● Fast● Distributed● Limited API● Some ML● I/O Options● Not indexed

Pandas● Fast● Single Machine● Full Feature API● Integration with ML● Different I/O

Options● Indexed● Easy to visualize

Panda IMG by Peter Beardsley

Simple Spark SQL Example

input = sqlContext.jsonFile(inputFile)input.registerTempTable("tweets")topTweets = sqlContext.sql("SELECT text, retweetCount" + "FROM tweets ORDER BY retweetCount LIMIT 10")local = topTweets.collect()

Convert a Spark DataFrame to Pandas

import pandas...ddf = sqlContext.read.json("hdfs://...")# Some Spark transformationstransformedDdf = ddf.filter(ddf['age'] > 21)return transformedDdf.toPandas()

Convert a Pandas DataFrame to Spark

import pandas...df = panda.DataFrame(...)...ddf = sqlContext.DataFrame(df)

Let’s combine the two

● Spark DataFrames already provides some of what we need○ Add UDFs / UDAFS○ Use bits of Pandas code

● http://spark-packages.org - excellent pace to get libraries

http://spark-packages.org

So where does the PB&J go?

SparkDataFrame

Sparkling Pandas API

Custom UDFS

Pandas Code

Sparkling Pandas

Scala Code

PySpark RDDs

Pandas Code

Internal State

Extending Spark - adding index support

self._index_names

def collect(self): """Collect the elements in an Dataframe and concatenate the partition.""" df = self._schema_rdd.toPandas() df = _update_index_on_df(df, self._index_names) return df

Extending Spark - adding index support

def _update_index_on_df(df, index_names): if index_names: df = df.set_index(index_names) # Remove names from unnamed indexes index_names = _denormalize_names(index_names) df.index.names = index_names return df

Adding a UDF in Python

sqlContext.registerFunction("strLenPython", lambda x: len(x), IntegerType())

Extending Spark SQL w/Scala for fun & profit

// functions we want to be callable from pythonobject functions { def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) }}

Extending Spark SQL w/Scala for fun & profit

def _create_function(name, doc=""): def _(col): sc = SparkContext._active_spark_context f = sc._jvm.com.sparklingpandas.functions, name jc = getattr(f)(col._jc if isinstance(col, Column) else col) return Column(jc) return __functions = { 'kurtosis': 'Calculate the kurtosis, maybe!',}

Simple graphing with Sparkling Pandas

import matplotlib.pyplot as pltplot = speaker_pronouns["pronoun"].plot()plot.get_figure().savefig("/tmp/fig")

Not yet merged in

Why is SparklingPandas fast*?Keep stuff in the JVM as much as possible.

Lazy operations

Distributed

*For really flexible versions of the word fast

Coffee

by eltpics

Panda image by Stéfan

Panda image by cactusroot

Supported operations:

DataFrames● to_spark_sql● applymap● groupby● collect● stats● query● axes● ftype● dtype

Context● simple● read_csv● from_data_frame● parquetFile● read_json● stop

GroupBy● groups● indices● first● median● mean● sum● aggregate

Always onwards and upwards

Now

Hypothetical, Wonderful Future

Wor

k do

ne

Time

Related Works

Blaze● http://continuum.io/blog/blaze AdaTao’s Distributed DataFrame● http://spark-summit.org/2014/talk/distributed-dataframe-

ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us

Numba● http://numba.pydata.org/

http://continuum.io/blog/blaze

http://continuum.io/blog/blaze

http://spark-summit.org/2014/talk/distributed-dataframe-ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us




http://numba.pydata.org/

http://numba.pydata.org/

Using Sparkling Pandas

You can get Sparkling Pandas from ● Website:

http://www.sparklingpandas.com● Code:

https://github.com/sparklingpandas/sparklingpandas ● Mailing List

https://groups.google.com/d/forum/sparklingpandas

http://www.sparklingpanda.coms

http://www.sparklingpanda.coms

https://github.com/sparklingpandas/sparklingpandas

https://github.com/sparklingpandas/sparklingpandas



Getting Sparkling Pandas friends

The examples from this will get merged into master.Pandas● http://pandas.pydata.org/ (or pip)Spark● http://spark.apache.org/



http://spark.apache.org/

http://spark.apache.org/

many pandas by David Goehring

Any questions?

Download - Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

Top Related