Sparkling PandasScaling Pandas beyond a single machine
(or letting Pandas Roam)With Special thanks to Juliet Hougland :)
Sparkling PandasScaling Pandas beyond a single machine
(or letting Pandas Roam)With Special thanks to Juliet Hougland :)
Who am I?
Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Engineer at Alpine Data Labs
○ previously DataBricks, Google, Foursquare, Amazon● @holdenkarau● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
What is Pandas?
user_id panda_type
01234 giant
12345 red
23456 giant
34567 giant
45678 red
56789 giant
● DataFrames--Indexed, tabular data structures● Easy slicing, indexing, subsetting/filtering● Excellent support for time series data● Data alignment and reshaping
http://pandas.pydata.org/
What is Spark?
Fast general engine for in memory data processing.
tl;dr - 100x faster than Hadoop MapReduce*
The different pieces of Spark
Apache Spark
SQL & DataFrames Streaming Language
APIs
Scala, Java, Python, & R
Graph Tools
Spark ML bagel & Grah X
MLLib Community Packages
Some Spark terms
Spark Context (aka sc)● The window to the world of SparksqlContext● The window to the world of DataFramesTransformation● Takes an RDD (or DataFrame) and returns a new RDD
or DataFrameAction● Causes an RDD to be evaluated (often storing the
result)
Dataframes between Spark & Pandas
Spark● Fast● Distributed● Limited API● Some ML● I/O Options● Not indexed
Pandas● Fast● Single Machine● Full Feature API● Integration with ML● Different I/O
Options● Indexed● Easy to visualize
Panda IMG by Peter Beardsley
Simple Spark SQL Example
input = sqlContext.jsonFile(inputFile)input.registerTempTable("tweets")topTweets = sqlContext.sql("SELECT text, retweetCount" + "FROM tweets ORDER BY retweetCount LIMIT 10")local = topTweets.collect()
Convert a Spark DataFrame to Pandas
import pandas...ddf = sqlContext.read.json("hdfs://...")# Some Spark transformationstransformedDdf = ddf.filter(ddf['age'] > 21)return transformedDdf.toPandas()
Convert a Pandas DataFrame to Spark
import pandas...df = panda.DataFrame(...)...ddf = sqlContext.DataFrame(df)
Let’s combine the two
● Spark DataFrames already provides some of what we need○ Add UDFs / UDAFS○ Use bits of Pandas code
● http://spark-packages.org - excellent pace to get libraries
So where does the PB&J go?
SparkDataFrame
Sparkling Pandas API
Custom UDFS
Pandas Code
Sparkling Pandas
Scala Code
PySpark RDDs
Pandas Code
Internal State
Extending Spark - adding index support
self._index_names
def collect(self): """Collect the elements in an Dataframe and concatenate the partition.""" df = self._schema_rdd.toPandas() df = _update_index_on_df(df, self._index_names) return df
Extending Spark - adding index support
def _update_index_on_df(df, index_names): if index_names: df = df.set_index(index_names) # Remove names from unnamed indexes index_names = _denormalize_names(index_names) df.index.names = index_names return df
Adding a UDF in Python
sqlContext.registerFunction("strLenPython", lambda x: len(x), IntegerType())
Extending Spark SQL w/Scala for fun & profit
// functions we want to be callable from pythonobject functions { def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) }}
Extending Spark SQL w/Scala for fun & profit
def _create_function(name, doc=""): def _(col): sc = SparkContext._active_spark_context f = sc._jvm.com.sparklingpandas.functions, name jc = getattr(f)(col._jc if isinstance(col, Column) else col) return Column(jc) return __functions = { 'kurtosis': 'Calculate the kurtosis, maybe!',}
Simple graphing with Sparkling Pandas
import matplotlib.pyplot as pltplot = speaker_pronouns["pronoun"].plot()plot.get_figure().savefig("/tmp/fig")
Not yet merged in
Why is SparklingPandas fast*?Keep stuff in the JVM as much as possible.
Lazy operations
Distributed
*For really flexible versions of the word fast
Coffee
by eltpics
Panda image by Stéfan
Panda image by cactusroot
Supported operations:
DataFrames● to_spark_sql● applymap● groupby● collect● stats● query● axes● ftype● dtype
Context● simple● read_csv● from_data_frame● parquetFile● read_json● stop
GroupBy● groups● indices● first● median● mean● sum● aggregate
Always onwards and upwards
Now
Hypothetical, Wonderful Future
Wor
k do
ne
Time
Related Works
Blaze● http://continuum.io/blog/blaze AdaTao’s Distributed DataFrame● http://spark-summit.org/2014/talk/distributed-dataframe-
ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-us
Numba● http://numba.pydata.org/
Using Sparkling Pandas
You can get Sparkling Pandas from ● Website:
http://www.sparklingpandas.com● Code:
https://github.com/sparklingpandas/sparklingpandas ● Mailing List
https://groups.google.com/d/forum/sparklingpandas
Getting Sparkling Pandas friends
The examples from this will get merged into master.Pandas● http://pandas.pydata.org/ (or pip)Spark● http://spark.apache.org/
many pandas by David Goehring
Any questions?