my data journey with python (scipy 2015 keynote)
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
My Data Journey with Python Wes McKinney @wesmckinn SciPy 2015 Keynote, 2015-‐07-‐09
3 © Cloudera, Inc. All rights reserved.
This talk
• 2007-‐present, from my perspecOve • CelebraOng our successes • Challenges and opportuniOes for the future
5 © Cloudera, Inc. All rights reserved.
My pre-‐2007 existence
• I was a mathemaOcian!
• No exposure to Python, SQL, R (or any analyOcs for that maYer)
• Rude awakening ahead
6 © Cloudera, Inc. All rights reserved.
My first job: AQR (quant hedge fund)
• A quant finance operaOon that lived and breathed SQL and Excel • ProducOon systems in C++, Java, Visual BASIC, and C# .NET • Some PhD-‐level researchers used MATLAB for research (as was common in finance / economics departments)
7 © Cloudera, Inc. All rights reserved.
ProducOvity frustraOons
• First year: several analyOcs and staOsOcal data analysis projects • A huge amount of SQL • Some Java • A liYle bit of R • … and TONS of Excel
• Projects felt like 5% conceptualizaOon, 95% tedium
8 © Cloudera, Inc. All rights reserved.
Python in early 2008: different Omes
• A bleeding edge stack • NumPy 1.0.4 • SciPy 0.6.0 • matplotlib 0.91.2 • IPython 0.8.4, SVN history begins 2/2008 • Cython 0.9.8
• The scienOfic Python community seemed mainly focused on aYracOng MATLAB, HPC, and scienOfic lab users
9 © Cloudera, Inc. All rights reserved.
2008: Things SciPythonistas didn’t care too much about
• RelaOonal data or SQL • Missing data handling (outside numpy.ma) • StaOsOcs and econometrics (first statsmodels release: 2011) • StaOsOcal graphics • Machine learning (scikit-‐learn 0.1: 2/2010) • AnalyOcs and business intelligence
10 © Cloudera, Inc. All rights reserved.
Taking a gamble
• Decided to give Python a shot for AQR projects aoer seeing part of MASS R package ported in scipy.stats.models by Jonathan Taylor at Stanford • proto-‐pandas first version built in April 2008 • Focused on porOng an R project to Python
• May ‘08: Embedded Python interpreter in a legacy C++ system • 5/2008 – 12/2008: Skunkworks Python ports and evangelism across company
11 © Cloudera, Inc. All rights reserved.
Why did Python work out?
• BaYeries included • Interoperability with C++ • Embedding Python interpreter • Wrapping C++ in Python C extensions
• ProducOve user interface • Python language • IPython + matplotlib
13 © Cloudera, Inc. All rights reserved.
Some other cool things we built
• A global macro risk modeling system (using pandas + NumPy + PyTables) • A heterogeneous market data loading and cleaning system • A task-‐based cluster compuOng system (similar to Celery) • Tick data storage and analyOcs • Various GUIs with wxPython + matplotlib
14 © Cloudera, Inc. All rights reserved.
End 2009: pandas!
• AQR lets me open source pandas 0.1 on Christmas, 2009.
~/Downloads/pandas-‐0.1 $ cloc -‐-‐exclude-‐ext pandas -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Language files blank comment code -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Python 41 3124 2933 8225 Cython 7 418 93 1247 C/C++ Header 1 0 0 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ SUM: 49 3542 3026 9473 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
15 © Cloudera, Inc. All rights reserved.
2010 – 2011: Python’s data growing pains
• pandas did not evolve much aoer its iniOal release • No consensus or momentum behind any project for analyOcs / data wrangling
• AQR —> Duke StaOsOcal Science • AQR sponsors bug fixes and new features in pandas
16 © Cloudera, Inc. All rights reserved.
May 2011: Gevng inspired
• 2011-‐05-‐13: Enthought Datarray Summit • Discuss how to enable Python to become more useful staOsOcal compuOng • Me: “Library fragmentaOon is destrucOve; integraOon is beYer” • Data structures, missing data, and data wrangling tools
• 2011-‐05-‐23 – 2011-‐06-‐03 : Python finance consulOng engagement • Realized that Python data tools sorely needed in industry • But not nearly mature enough yet
19 © Cloudera, Inc. All rights reserved.
Making pandas a beYer tool
• ConsulOng at AppNexus (NYC ad tech company) opened eyes to new problems
• June 2011 – December 2012 • Fix some pandas design issues • Build out data wrangling capabiliOes (hierarchical indexes, etc.) • Create “killer apps” (Ome series capabiliOes) • Evangelize and collaborate with other projects
21 © Cloudera, Inc. All rights reserved.
Making a book happen
• A chicken-‐and-‐egg problem • Fernando Pérez, Brian Granger, and John Hunter had been toying with the idea of a “SciPy Book” for a couple years • Decided to forge my own path in Nov 2011 • WriOng took about 9 months • Helped moOvate me to “finish” parts of pandas
• ~ 50,000 copies in circulaOon
22 © Cloudera, Inc. All rights reserved.
Clarity and sooware engineering
• Progress in sooware not just about hard work
• Solving the right problems • … in the right order • … while wasOng liYle Ome/energy on non-‐impac}ul issues • … while being faced with real world concerns (80/20 rule)
• Taking the Ome to develop a clear vision and scope for a project is a major factor in its success or failure
23 © Cloudera, Inc. All rights reserved.
It took a village
• Fernando Perez & Brian Granger (IPython) • Skipper Seabold & Josef Perktold (statsmodels) • Eric Jones (Enthought) • Travis Oliphant & Peter Wang (Enthought & ConOnuum) • John Hunter (matplotlib) • … and many others
25 © Cloudera, Inc. All rights reserved.
Seatmate: “Are you a programmer?” (he saw my Emacs buffers)
28 © Cloudera, Inc. All rights reserved.
Wes: “Cool, well, there’s this awesome new thing called the IPython notebook”
29 © Cloudera, Inc. All rights reserved.
My seatmate was computaOonal bio professor and 5-‐year PSF member Titus Brown
30 © Cloudera, Inc. All rights reserved.
And he would later assist the IPython team in their Sloan FoundaOon $1mm grant in 2012
32 © Cloudera, Inc. All rights reserved.
Business ventures 2012 -‐ 2014
• 2012 : Lambda Foundry • Support and develop pandas • Explored creaOng a commercial Python financial toolkit
• 2013 – 2014 : DataPad • “Google Drive for AnalyOcs / BI” • With Chang She (MIT —> AQR —> pandas) • Silicon Valley VC-‐backed • Acquired by Cloudera in September 2014
33 © Cloudera, Inc. All rights reserved.
Cloudera
• Sort of “the Red Hat of Big Data” • The leading open source Hadoop pla}orm • SupporOng and developing a liYle over 20 Apache-‐licensed open source projects
• A dream job • Full Ome open source development • Solving hard data problems faced by the world’s largest companies
• P.S. we’re hiring engineers in AusOn + Bay Area
34 © Cloudera, Inc. All rights reserved.
What I’m interested in right now
• Ways to enable collaboraOon on data tools across programming languages • Domain specific language design and compilaOon
• Improving the Python-‐on-‐Hadoop experience
• LLVM + Code generaOon
35 © Cloudera, Inc. All rights reserved.
Different kinds of Big Data
• Python programmers have been dealing with big scienOfic data in HPC sevngs for years • Big… • Text data • Homogeneous array data • Tabular (structured) data • JSON-‐like (semi-‐structured) data
36 © Cloudera, Inc. All rights reserved.
The Great Data Tool Decoupling™
• Thesis: over Ome, user interfaces, data storage, and execuOon engines will decouple and specialize • In fact, you should really want this to happen • Share systems among languages • Reduce fragmentaOon and “lock-‐in” • Shio developer focus to usability
• PredicOon: we’ll be there by 2025; sooner if we all get our act together