ibis: scaling the python data experience
TRANSCRIPT
1 © Cloudera, Inc. All rights reserved.
Ibis: Scaling the Python Data Experience Wes McKinney Marcel Kornacker JusFn Erickson Silvius Rus
2 © Cloudera, Inc. All rights reserved.
Wes McKinney
• A key person in building today’s open source Python data community • Creator of pandas, a standard Python data wrangling and analyFcs toolkit used by data scienFsts • Author of best-‐selling canonical text Python for Data Analysis (2012) • Formerly Founder/CEO of DataPad (acquired by Cloudera in 2014)
3 © Cloudera, Inc. All rights reserved.
Python is popular…
• Python has become a standard language of data science • Why is it popular? • Maximizes producFvity for data engineers and data scienFsts • Build robust so[ware and do interacFve data analysis with 100% Python code • Easy-‐to-‐learn and makes happy and producFve data teams • Large, diverse open source development community • Comprehensive libraries: data wrangling, ML, visualizaFon, etc.
• Main use case: data science & engineering swiss army knife on small-‐to-‐medium size data
4 © Cloudera, Inc. All rights reserved.
…but Python does not scale today
• Python ecosystem confined to single-‐node analysis • Great for smaller data sets • Requires sampling or aggregaFons for larger data • Distributed tools compromise in various ways
• ExtracFng samples or aggregaFons for larger data means: • “Scales” by losing more fidelity • AddiFonal ETL overhead to extract samples/aggregaFons • Loss of producFvity with mulFple languages, tools, etc • Blocks certain analysis and use cases
5 © Cloudera, Inc. All rights reserved.
Ibis: Same Python, now at scale
• Target user: • Data scienFsts and data engineers (“Python data users”)
• Goals: • Mirrors single-‐node Python experience • Scales to any node and data size • No compromise in funcFonality or usability • InteracFve experience at naFve hardware speeds
6 © Cloudera, Inc. All rights reserved.
What’s announced?
• First public release of Ibis • hgp://ibis-‐project.org
• Beta release to Cloudera Labs • InviFng usage and community development • Apache-‐licensed open-‐source
7 © Cloudera, Inc. All rights reserved.
Ibis’s Vision
• Uncompromised Python experience • 100% Python end-‐to-‐end user workflows • Enable integraFon with the exisFng Python data ecosystem (pandas, scikit-‐learn, NumPy, etc)
• InteracFve at big data scale • Full-‐fidelity analysis without extracFons • Scalability for big data • NaFve hardware speeds for a broad set of use cases
8 © Cloudera, Inc. All rights reserved.
9 © Cloudera, Inc. All rights reserved.
Advantages of our approach
• Analyze big data 100% in Python, with the same ease as small/medium data on the local filesystem • Full-‐fidelity data access • Familiar Python experience and integraFon with exisFng Python data libraries • Provide a means for Python high performance compuFng tools to be leveraged at Hadoop-‐scale
10 © Cloudera, Inc. All rights reserved.
Beta 0.3 release
• High level Python API for describing analyFcs and ETL that can be executed by Impala • Familiar API for users of pandas • Comprehensive coverage of operaFons expressible as relaFonal data flows
• Integrated tools for managing data in HDFS • Simple workflows to query data files in several formats (Parquet, Avro, Text) • pandas data interchange
11 © Cloudera, Inc. All rights reserved.
Ibis/Impala Joint Roadmap
• More natural data modeling • Complex types support
• IntegraFon with full Python data ecosystem • Advanced analyFcs + machine learning • Enable use of performance compuFng tools
• User extensibility with naFve performance • In-‐memory columnar format • Python-‐to-‐LLVM IR compilaFon
• Workflow and usability tools
12 © Cloudera, Inc. All rights reserved.
Benefits of Ibis
• Maximize developer producFvity • Mirrors single-‐node Python experience • Solve big data problems without leaving Python • Leverage Python skills, ecosystem, and tools
• Python as first-‐class language for Hadoop • Full-‐fidelity analysis without extracFons • Python analysis at any scale • NaFve hardware speeds for a broad set of use cases
13 © Cloudera, Inc. All rights reserved.
Thank you [email protected]