strata ny 2017 parquet arrow roadmap
TRANSCRIPT
![Page 1: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/1.jpg)
The columnar roadmap:
Apache Parquet and Apache ArrowJulien Le Dem, VP Apache Parquet, Apache Arrow PMC
![Page 2: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/2.jpg)
• Creator of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet
• Formerly Tech Lead at Twitter on Data Platforms.
Julien Le Dem@J_ Julien
![Page 3: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/3.jpg)
Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow based communication
![Page 4: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/4.jpg)
Community Driven Standard
![Page 5: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/5.jpg)
An open source standard
• Parquet: Common need for on disk columnar.• Arrow: Common need for in memory columnar.• Arrow is building on the success of Parquet.• Top-level Apache project• Standard from the start:
– Members from 13+ major open source projects involved
• Benefits:– Share the effort– Create an ecosystem
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
![Page 6: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/6.jpg)
Interoperability and EcosystemBefore With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and deserialization
• Functionality duplication and unnecessary conversions
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg:Parquet-to-Arrow reader)
![Page 7: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/7.jpg)
Benefits of Columnar representation
![Page 8: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/8.jpg)
Columnar layout
Logical table
representation
Row layout
Column layout
@EmrgencyKittens
![Page 9: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/9.jpg)
On Disk and in Memory
• Different trade offs– On disk: Storage.
• Accessed by multiple queries.
• Priority to I/O reduction (but still needs good CPU throughput).
• Mostly Streaming access.
– In memory: Transient.• Specific to one query execution.
• Priority to CPU throughput (but still needs good I/O).
• Streaming and Random access.
![Page 10: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/10.jpg)
Parquet on disk columnar format
![Page 11: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/11.jpg)
Parquet on disk columnar format
• Nested data structures
• Compact format: – type aware encodings
– better compression
• Optimized I/O:– Projection push down (column pruning)
– Predicate push down (filters based on stats)
![Page 12: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/12.jpg)
Parquet nested representation
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns:docidlinks.backwardlinks.forwardname.language.codename.language.countryname.url
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
![Page 13: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/13.jpg)
Access only the data you need
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Columnar StatisticsRead only the data you need!
![Page 14: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/14.jpg)
Parquet file layout
![Page 15: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/15.jpg)
Arrow in memory columnar format
![Page 16: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/16.jpg)
Arrow goals
• Well-documented and cross language compatible
• Designed to take advantage of modern CPU
• Embeddable
– in execution engines, storage layers, etc.
• Interoperable
![Page 17: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/17.jpg)
Arrow in memory columnar format
• Nested Data Structures
• Maximize CPU throughput
– Pipelining
– SIMD
– cache locality
• Scatter/gather I/O
![Page 18: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/18.jpg)
CPU pipeline
![Page 19: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/19.jpg)
Columnar data
persons = [{name: ’Joe',age: 18,phones: [
‘555-111-1111’, ‘555-222-2222’
] }, {
name: ’Jack',age: 37,phones: [ ‘555-333-3333’ ]
}]
![Page 20: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/20.jpg)
Record Batch Construction
Schema Negotiation
Dictionary Batch
Record Batch
Record Batch
Record Batch
name (offset)
name (data)
age (data)
phones (list offset)
phones (data)
data header (describes offsets into data)
name (bitmap)
age (bitmap)
phones (bitmap)
phones (offset)
{name: ’Joe',age: 18,phones: [
‘555-111-1111’, ‘555-222-2222’
] }
Each box (vector) is contiguous memory The entire record batch is contiguous on wire
![Page 21: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/21.jpg)
Vertical integration: Parquet to Arrow
![Page 22: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/22.jpg)
Representation comparison for flat schema
![Page 23: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/23.jpg)
Naïve conversion
![Page 24: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/24.jpg)
Naïve conversion Data dependent branch
![Page 25: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/25.jpg)
Peeling away abstraction layersVectorized read
(They have layers)
![Page 26: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/26.jpg)
Bit packing case
![Page 27: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/27.jpg)
Bit packing case No branch
![Page 28: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/28.jpg)
Run length encoding case
![Page 29: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/29.jpg)
Run length encoding case
Block of defined values
![Page 30: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/30.jpg)
Run length encoding case
Block of null values
![Page 31: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/31.jpg)
Predicate push down
![Page 32: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/32.jpg)
Example: filter and projection
![Page 33: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/33.jpg)
Naive filter and projection implementation
![Page 34: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/34.jpg)
Peeling away abstractions
![Page 35: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/35.jpg)
Arrow based communication
![Page 36: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/36.jpg)
Universal high performance UDFs
SQL engine
Pythonprocess
User defined function
SQLOperator
1
SQLOperator
2
readsreads
![Page 37: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/37.jpg)
Arrow RPC/REST API
• Generic way to retrieve data in Arrow format
• Generic way to serve data in Arrow format
• Simplify integrations across the ecosystem
• Arrow based pipe
![Page 38: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/38.jpg)
RPC: arrow based storage interchange
The memory representation is sent over the wire.
No serialization overhead.
Scanner
projection/predicate push down
Operator
Arrow batches
Storage
Mem Disk
SQLexecution
Scanner Operator
Scanner Operator
Storage
Mem Disk
Storage
Mem Disk
…
![Page 39: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/39.jpg)
RPC: arrow based cache
The memory representation is sent over the wire.
No serialization overhead.
projection push down
Operator
Arrow-basedCache
SQLexecution
Operator
Operator
…
![Page 40: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/40.jpg)
RPC: Single system execution
The memory representation is sent over the wire.
No serialization overhead.
Scanner
Scanner
Scanner
Parquet files
projection push downread only a and b
Partial Agg
Partial Agg
Partial Agg
Agg
Agg
Agg
ShuffleArrow batches
Result
![Page 41: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/41.jpg)
Results- PySpark Integration:
53x speedup (IBM spark work on SPARK-13534)http://s.apache.org/arrowresult1
- Streaming Arrow Performance7.75GB/s data movementhttp://s.apache.org/arrowresult2
- Arrow Parquet C++ Integration4GB/s readshttp://s.apache.org/arrowresult3
- Pandas Integration9.71GB/shttp://s.apache.org/arrowresult4
![Page 42: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/42.jpg)
Language Bindings
Parquet
• Target Languages– Java
– CPP
– Python & Pandas
• Engines integration:– Many!
Arrow
• Target Languages– Java
– CPP, Python
– R (underway)
– C, Ruby, JavaScript
• Engines integration:– Drill
– Pandas, R
– Spark (underway)
![Page 43: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/43.jpg)
Arrow Releases
237
131
76
17
62
22 34
178195
311
89
138
93
137
October 10, 2016 February 18,2017
May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17,2017
0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0
Days Changes
![Page 44: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/44.jpg)
Current activity:
• Spark Integration (SPARK-13534)
• Pages index in Parquet footer (PARQUET-922)
• Arrow REST API (ARROW-1077)
• Bindings:
– C, Ruby (ARROW-631)
– JavaScript (ARROW-541)
![Page 45: Strata NY 2017 Parquet Arrow roadmap](https://reader033.vdocuments.site/reader033/viewer/2022042619/5a6499b07f8b9a4c568b4e01/html5/thumbnails/45.jpg)
Get Involved
• Join the community
– dev@{arrow,parquet}.apache.org
– Slack:
• Arrow: https://apachearrowslackin.herokuapp.com/
• Parquet: https://parquet-slack-invite.herokuapp.com/
– http://{arrow,parquet}.apache.org
– Follow @Apache{Parquet,Arrow}