pydata: the next generation

1 © Cloudera, Inc. All rights reserved.

PyData: The Next Genera@on Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15


PyData: Everything’s awesome…or is it? Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15


Me

• Data systems, tools, Python guru at Cloudera •  Formerly Founder/CEO of DataPad (visual analy@cs startup) • Created pandas in 2008, lead developer un@l 2013 • Python for Data Analysis, published 10/2012 • O’Reilly’s best-‐selling data book of 2014

• Pythonista since 2007


What’s this about?

• Hopes and fears for the community and ecosystem • Why do I care? • Python is fun! • Leverage • Accessibility for newbies • Community: smart, nice, humble people


Python at Cloudera

• Want Cloudera plaaorm users to be successful with Python

•  Spark/PySpark part of the Enterprise Data Hub / CDH

• Ac@vely inves@ng in Python tooling •  (p.s. we’re hiring?) •  (p.p.s. we have an Aus@n office now!)


Historical perspec@ve and background

• 20 years of fast numerical compu@ng in Python (Numeric 1995) • 10 years of NumPy • PyData becomes a thing in 2012 • Python as a data language goes mainstream •  Job descrip@ons tell all • Shig in larger Python community from web towards data • PyCon 2015 commihee reported substan@al growth in data-‐related submissions!


How’d this happen?

• Data, data everywhere •  Science! scikit-‐learn, statsmodels, and friends • Comprehensive data wrangling tools and in-‐memory analy@cs/repor@ng (pandas) •  IPython Notebook •  Learning resources (books, conferences, blogs, etc.) • Python environment/library management that “just works”


Put a Python (interface) on it! Something no one got fired for, ever.


Meanwhile…

• Hadoop and Big Data go mainstream in 2009 onward • First Hadoop World: Fall 2009 • First Strata conference: Spring 2011

•  Lots of smart engineers in fast-‐growing businesses with massive analy@cs / ETL problems •  Solu@ons built, frameworks developed, companies founded • Python was generally not a central part of those solu@ons • A lot of our nice things weren’t much help for data munging and coun@ng at scale (more on this later)


We’re lucky to have lots of nice things

• What a language! •  IPython: interac@ve compu@ng and collabora@on •  Libraries to solve nearly any (non-‐big data) problem • Trustworthy (medium) data wrangling, sta@s@cs, machine learning • HPC / GPU / parallel compu@ng frameworks •  FFI tools • … and much more


“If this isn’t nice, what is?” —Kurt Vonnegut


So, what kind of big data?

• Big mul@dimensional arrays / linear algebra

• Big tables (structured data)

• Big text data (unstructured data)

• Empirically I personally am mostly interested in big tables


What kind of big data problems?

• ETL / Data Wrangling • Python been used here for years with Hadoop Streaming

• BI / Analy@cs (“things you can do in SQL”)

• Advanced Analy@cs / Machine Learning


Some ways we are #winning

• Python seen as a viable alterna@ve to SAS/MATLAB/proprietary sogware without nearly as much arguing

• Huge uptake in the financial sector

• Many current and upcoming genera@ons of data scien@sts learning Python as a first language

• Python in HPC / scien@fic compu@ng


Some ways we are not #winning

• Python s@ll doesn’t have a great “big data story”

•  Lihle venture capital trickling down to Python projects

• Data structures and programming APIs lagging modern reali@es • Weak support for emerging data formats

• Many companies with Python big data successes have not open-‐sourced their work


Python in big data workflows in prac@ce

HDFS Hadoop-‐MR

Spark SQL

Big Data, Many machines Small/Medium Data, One Machine

pandas

Viz tools

ML / Stats

More coun@ng / ETL More insights / repor@ng

DSLs


Big data storage formats

•  JSON and CSV are not a good way to warehouse data • Apache Avro • Compact binary data serializa@on format • RPC framework

• Apache Parquet • Efficient columnar data format op@mized for HDFS • Supports nested and repeated fields, compression, encoding schemes • Co-‐developed by Twiher and Cloudera • Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)


We’re living in a JVM world

•  Scala rapidly taking over big data analy@cs • Func@onal, concise, good for building high level DSLs • Build nice Scala APIs to clunkier Java frameworks

•  JVM legi@mately good for concurrent, distributed systems

• Binary interface with Python a major issue


Dremel, baby, Dremel…

• VLDB 2010: Dremel: Interac5ve Analysis of Web-‐Scale Datasets •  Inspira@on for Parquet (cf blog “Dremel made easy with Parquet”) • Peta-‐scale analy@cs directly on nested data

• Google BigQuery said to be a IaaS-‐ifica@on of Dremel • Supports SQL variant + new user-‐defined func@ons with JavaScript + V8

SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1,

SUM(a.b.p.q.r) WITHIN RECORD AS c2

FROM T3)


Cloudera Impala

• Open-‐source interac@ve SQL for Hadoop

• Analy@cal query processor wrihen in C++ with LLVM code genera@on • Op@mized to scan tables (best as Parquet format) in HDFS •  SQL front-‐end and query op@mizer / planner • User-‐defined func@on API (C++) •  impyla enables Python UDFs to be compiled with Numba to LLVM IR


Cloudera Impala (cont’d)

•  For high performance big data analy@cs, Impala could be Python’s best friend

• C++/LLVM backend is lower-‐level than SQL

• Nested data support is coming


Some interes@ng things in recent @mes


Set point: Hadley Wickham

• R has upped it’s game with dplyr, @dyr, and other new projects • New standard for a uniform interface to either in-‐memory or in-‐database data processing • Composable table primi@ve opera@ons • Mul@ple major versions shipped, gevng adopted

80dc69b 2012-10-28 | Initial commit of dplyr [hadley]

tbl %>% filter(c==‘bar’) %>% group_by(a, b) %>% summarise(metric=mean(d – f)) %>% arrange(desc(metric))


Blaze

•  Shares some seman@cs with dplyr • Uses a generalized datashape protocol

•  Fresh start in 2014 under Mahhew Rocklin’s (Con@nuum) direc@on • Deferred expression API • Support for piping data between storage systems • Mul@ple backends (pandas, SQL, MongoDB, PySpark, …) • Growing support for out-‐of-‐core analy@cs


libdynd

•  Led by Mark Wiebe at Con@nuum Analy@cs • Pure C++11 modern reimagining of NumPy • Python bindings •  Supports variadic data cells and nested types (datashape protocol)

• Development has focused on the data container design over analy@cs


PySpark

• Popularity may exceed official Scala API •  Spark was not exactly designed to be an ideal companion to Python • General architecture • Users build Spark deferred expression graphs in Python • User-‐supplied func@ons are serialized and broadcast around the cluster • Spark plans job and breaks work into tasks executed by Python worker jobs • Data is managed / shuffled by the Spark Scala master process • Python used largely as a black box to transform input to output


PySpark: Some more gory details

•  Spark master controlled using py4j • Py4J docs: “If performance is cri@cal to your applica@on, accessing Java objects from Python programs might not be the best idea”

• Data is marshalled mostly with files with various serializa@on protocols (pickle + bespoke formats)

• Does not na5vely interface with NumPy (yet) • But, the in-‐memory benefits of Spark over Hadoop Streaming alterna@ves massively outweigh the downsides

# pass large object by py4j is very slow and need much memory


Spartan

• hhp://github.com/spartan-‐array/spartan • Python distributed array expression evaluator (“distributed NumPy”) • Developed by Russell Power & others at NYU • Uses ZeroMQ and custom RPC implementa@on


Things I think we should do

• Create high fidelity data structures for Dremel-‐style data

• Get serious about Avro, Parquet, and other new data format standards

•  Invest in the Python-‐Impala-‐LLVM rela@onship

• Efficient binary protocols to receive and emit data from Python processes


Conclusions

• Python + PyData stack is as strong as ever, and s@ll gaining momentum

• The @me for a “dark horse” Python-‐centric big data solu@on has probably passed us by. Maybe beher to pursue alliances.

•  Focused work is needed to s@ll be relevant in 2020. Some of our compe@@ve advantages are eroding


Thank you Wes McKinney @wesmckinn [email protected]

pydata: the next generation

Technology

data language

data munging

data pycon

data analysis

data systems

etl data wrangling python

kind of big data problems

datarelated submissions