next-generation python big data tools, powered by apache arrow
TRANSCRIPT
![Page 1: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/1.jpg)
1 © Cloudera, Inc. All rights reserved.
Next-‐genera;on Python Big Data Tools, powered by Apache Arrow Wes McKinney @wesmckinn SF Big Analy;cs Meetup, 2016-‐04-‐05
![Page 2: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/2.jpg)
2 © Cloudera, Inc. All rights reserved.
Me
• Data Science Tools at Cloudera, formerly DataPad CEO/founder • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Open source projects
• Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incuba;ng)}
• Mostly work in Python and Cython/C/C++
![Page 3: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/3.jpg)
3 © Cloudera, Inc. All rights reserved.
In process: Python for Data Analysis: 2nd Edi4on Coming late 2016 / early 2017
![Page 4: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/4.jpg)
4 © Cloudera, Inc. All rights reserved.
Python + Big Data: The State of things
• See “Python and Apache Hadoop: A State of the Union” from February 17 • Areas where much more work needed
• Binary file format read/write support (e.g. Parquet files) • File system libraries (HDFS, S3, etc.) • Client drivers (Spark, Hive, Impala, Kudu) • Compute system integra;on (Spark, Impala, etc.)
![Page 5: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/5.jpg)
5 © Cloudera, Inc. All rights reserved.
Apache Arrow
Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow
![Page 6: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/6.jpg)
6 © Cloudera, Inc. All rights reserved.
Arrow in a Slide
• New Top-‐level Apache Sofware Founda;on project • Announced Feb 17, 2016
• Focused on Columnar In-‐Memory Analy;cs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of
breed systems 3. Designed to work with any programming language 4. Support for both rela;onal and complex data as-‐is
• Developers from 13+ major open source projects involved • A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
![Page 7: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/7.jpg)
7 © Cloudera, Inc. All rights reserved.
Apache Arrow: What is it?
• hkp://arrow.apache.org • Not a piece of sofware, exactly! • A standardized in-‐memory representa;on for columnar data • Enables
• Suitable for implemen;ng high-‐performance analy;cs in-‐memory (think like “pandas internals”)
• Cheap data interchange amongst systems, likle or no serializa;on • Flexible support for complex JSON-‐like data
• Targets: Impala, Kudu, Parquet, Spark
![Page 8: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/8.jpg)
8 © Cloudera, Inc. All rights reserved.
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional Memory Buffer
Arrow Memory Buffer
• Cache Locality • Super-‐scalar & vectorized opera;on
• Minimal Structure Overhead • Constant value access
• With minimal structure overhead • Operate directly on columnar compressed data
![Page 9: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/9.jpg)
9 © Cloudera, Inc. All rights reserved.
High Performance Sharing & Interchange Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & ConvertCopy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
![Page 10: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/10.jpg)
10 © Cloudera, Inc. All rights reserved.
Big Data Systems: Poor Python IO performance
h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/
![Page 11: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/11.jpg)
11 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R • Problem: fast, language-‐agnos;c binary data frame file format
• Wriken by Wes McKinney (Python) Hadley Wickham (R)
• Read speeds close to disk IO performance
Arrow array 0Arrow array 1
…Arrow array n
Feather metadata
Feather file
Apache Arrow memory
Google flatbuffers
![Page 12: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/12.jpg)
12 © Cloudera, Inc. All rights reserved.
Real World Example: Feather File Format for Python and R
library(feather) path <-‐ "my_data.feather" write_feather(df, path) df <-‐ read_feather(path)
import feather path = 'my_data.feather' feather.write_dataframe(df, path) df = feather.read_dataframe(path)
R Python
![Page 13: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/13.jpg)
13 © Cloudera, Inc. All rights reserved.
Apache Parquet: Binary columnar storage format
• I just became a Parquet commiker! • github.com/apache/parquet-‐cpp • Python users will soon be able to read Parquet files via PyArrow • parquet-‐cpp <-‐> PyArrow <-‐> pandas
![Page 14: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/14.jpg)
14 © Cloudera, Inc. All rights reserved.
Language Bindings • Target Languages
• Java (beta) • CPP (underway) • Python & Pandas (underway) • R • Julia
• Ini;al Focus • Read a structure • Write a structure • Manage Memory
![Page 15: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/15.jpg)
15 © Cloudera, Inc. All rights reserved.
pandas and Arrow in context
![Page 16: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/16.jpg)
16 © Cloudera, Inc. All rights reserved.
RPC & IPC: Moving Data Between Systems RPC • Avoid Serializa;on & Deserializa;on • Layer TBD: Focused on suppor;ng vectored io
• Scaker/gather reads/writes against socket
IPC • Alpha implementa;on using memory mapped files
• Moving data between Python and Drill • Working on shared alloca;on approach
• Shared reference coun;ng and well-‐defined ownership seman;cs
![Page 17: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/17.jpg)
17 © Cloudera, Inc. All rights reserved.
Execu;ng data science languages in the compute layer
UIIbis, SQL, Spark API, …
ComputeAnalytic SQL, Spark, MapReduce
StorageHDFS, Kudu, HBase
Python, R, Julia, …?
![Page 18: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/18.jpg)
18 © Cloudera, Inc. All rights reserved.
Real World Example: Python With Spark, Drill, Impala
in partition 0
…
in partition n - 1
SQL Engine
Python function
input
Python function
input
User-supplied Python code
output
output
out partition 0
…
out partition n - 1
SQL Engine
![Page 19: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/19.jpg)
19 © Cloudera, Inc. All rights reserved.
What’s Next • Parquet for Python & C++
• Using Arrow as intermediary • Available IPC Implementa;on • Spark, Drill Integra;on
• Faster UDFs, Storage interfaces
![Page 20: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/20.jpg)
20 © Cloudera, Inc. All rights reserved.
Apache Arrow in prac;ce
![Page 21: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/21.jpg)
21 © Cloudera, Inc. All rights reserved.
Get Involved • Join the community
• [email protected] • Slack: hkps://apachearrowslackin.herokuapp.com/ • hkp://arrow.apache.org • @ApacheArrow
![Page 22: Next-generation Python Big Data Tools, powered by Apache Arrow](https://reader033.vdocuments.site/reader033/viewer/2022042907/587adcf11a28ab542b8b59a5/html5/thumbnails/22.jpg)
22 © Cloudera, Inc. All rights reserved.
Thank you Wes McKinney @wesmckinn Views are my own