parquet strata/hadoop world, new york 2013
DESCRIPTION
Presentation of Parquet at Strata/Hadoop World 2013 in New York CityTRANSCRIPT
![Page 1: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/1.jpg)
ParquetColumnar storage for the people
Julien Le Dem @J_ Processing tools lead, analytics infrastructure at TwitterNong Li [email protected] Software engineer, Cloudera Impala
http://parquet.io
![Page 2: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/2.jpg)
• Context from various companies
• Results in production and benchmarks
• Format deep-dive
http://parquet.io
Outline
![Page 3: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/3.jpg)
http://parquet.io
Twitter Context• Twitter’s data
• 230M+ monthly active users generating and consuming 500M+ tweets a day.• 100TB+ a day of compressed data• Scale is huge:
• Instrumentation, User graph, Derived data, ...
• Analytics infrastructure:• Several 1K+ node Hadoop clusters• Log collection pipeline• Processing tools The Parquet Planers
Gustave Caillebotte
![Page 4: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/4.jpg)
• Logs available on HDFS• Thrift to store logs• example: one schema has 87 columns, up to 7 levels of nesting.
http://parquet.io
Twitter’s use case
struct LogEvent { 1: optional logbase.LogBase log_base 2: optional i64 event_value 3: optional string context 4: optional string referring_event... 18: optional EventNamespace event_namespace 19: optional list<Item> items 20: optional map<AssociationType,Association> associations 21: optional MobileDetails mobile_details 22: optional WidgetDetails widget_details 23: optional map<ExternalService,string> external_ids}
struct LogBase { 1: string transaction_id, 2: string ip_address, ... 15: optional string country, 16: optional string pid,}
![Page 5: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/5.jpg)
http://parquet.io
Goal
To have a state of the art columnar storage available across the Hadoop platform
• Hadoop is very reliable for big long running queries but also IO heavy.• Incrementally take advantage of column based storage in existing framework.• Not tied to any framework in particular
![Page 6: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/6.jpg)
• Limits IO to data actually needed:• loads only the columns that need to be accessed.
• Saves space: • Columnar layout compresses better• Type specific encodings.
• Enables vectorized execution engines.
http://parquet.io
Columnar Storage@EmrgencyKittens
![Page 7: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/7.jpg)
http://parquet.io
Collaboration between Twitter and Cloudera:
• Common file format definition: • Language independent• Formally specified.
• Implementation in Java for Map/Reduce: • https://github.com/Parquet/parquet-mr
• C++ and code generation in Cloudera Impala: • https://github.com/cloudera/impala
![Page 8: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/8.jpg)
http://parquet.io
Results in Impala TPC-H lineitem table @ 1TB scale factor
GB
![Page 9: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/9.jpg)
http://parquet.io
Impala query times on TPC-DS
0
125
250
375
500
Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96
Seco
nds
(wal
l clo
ck)
TextSeq w/ SnappyRC w/SnappyParquet w/Snappy
![Page 10: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/10.jpg)
http://parquet.io
Criteo: The Context
• Billions of new events per day• ~60 columns per log• Heavy analytic workload• BI analysts using Hive and RCFile• Frequent schema modifications
• Perfect use case for Parquet + Hive !
![Page 11: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/11.jpg)
http://parquet.io
Parquet + Hive: Basic Reqs
• MapRed compatibility due to Hive.
• Correctly handle evolving schemas across Parquet files.
• Read only the columns used by query to minimize data read.
• Interoperability with other execution engines (eg Pig, Impala, etc.)
![Page 12: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/12.jpg)
0
5000
10000
15000
20000
q19 q34 q42 q43 q46 q52 q55 q59 q63 q65 q68 q7 q73 q79 q8 q89 q98
http://parquet.io
Performance of Hive 0.11 with Parquet vs orcto
tal C
PU s
econ
ds
orc-snappyparquet-snappy
TPC-DS scale factor 100All jobs calibrated to run ~50 mappersNodes:2 x 6 cores, 96 GB RAM, 14 x 3TB DISK
Size relative to text:orc-snappy: 35%parquet-snappy: 33%
![Page 13: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/13.jpg)
http://parquet.io
Twitter: production results
• Space saving: 30% using the same compression algorithm• Scan + assembly time compared to original:
• One column: 10%• All columns: 110%
Data converted: similar to access logs. 30 columns.Original format: Thrift binary in block compressed files (LZO)New format: Parquet (LZO)
0%20.0%40.0%60.0%80.0%
100.0%120.0%
1 30
Scan time
columns
Thrift Parquet0%
25.00%
50.00%
75.00%
100.00%
Space
Space
Thrift Parquet
![Page 14: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/14.jpg)
http://parquet.io
14
Production savings at Twitter
• Petabytes of storage saved.• Example jobs taking advantage of projection push down:
• Job 1 (Pig): reading 32% less data => 20% task time saving.• Job 2 (Scalding): reading 14 out of 35 columns. reading 80% less data => 66% task time saving.
• Terabytes of scanning saved every day.
![Page 15: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/15.jpg)
http://parquet.io
15
Format• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing. • One (or more) per split while reading. • roughly: 50MB < row group < 1 GB
• Column chunk: The data for one column in a row group.• Column chunks can be read independently for efficient scans.
• Page: Unit of access in a column chunk.• Should be big enough for compression to be efficient.• Minimum size to read to access a single record (when index pages are available). • roughly: 8KB < page < 1MB
Row group
Column a
Page 0
Row group
Page 1
Page 2
Column b Column c
Page 0
Page 1
Page 2
Page 0
Page 1
Page 2Page 3
Page 4 Page 3
![Page 16: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/16.jpg)
http://parquet.io
16
Format
• Layout: Row groups in columnar format. A footer contains column chunks offset and schema.
• Language independent: Well defined format. Hadoop and Cloudera Impala support.
![Page 17: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/17.jpg)
http://parquet.io
17
Nested record shredding/assembly• Algorithm borrowed from Google Dremel's column IO • Each cell is encoded as a triplet: repetition level, definition level, value.• Level values are bound by the depth of the schema: stored in a compact form.
Columns Max rep. level
Max def. level
DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Column Value R D
DocId 20 0 0
Links.Backward 10 0 2
Links.Backward 30 1 2
Links.Forward 80 0 2
Schema:
message Document { required int64 DocId; optional group Links { repeated int64 Backward; repeated int64 Forward; } }
Record:
DocId: 20 Links Backward: 10 Backward: 30 Forward: 80
Document
DocId Links
Backward Forward
Document
DocId Links
Backward Forward
20
10 30 80
![Page 18: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/18.jpg)
http://parquet.io
Repetition level
Columns:Level: 0,2,2,1,2,2,2,0,1,2Data: a,b,c,d,e,f,g,h,i,j
Nested lists level1 level2: a
level2: d
level2: hlevel1
0 21
level2: b
level2: c
level2: e
level2: f
level2: g
level2: j
level2: i
0
2
2
level1
level1
1
2
2
2
Nested lists
2
1
0
R
new record
new record
new level2 entry
new level2 entry
new level1 entry
new level2 entry
new level2 entry
new level2 entry
new level2 entry
new level1 entry
Schema:message nestedLists { repeated group level1 { repeated string level2; }}
Records:[[a, b, c], [d, e, f, g]][[h], [i, j]]
more details: https://blog.twitter.com/2013/dremel-made-simple-with-parquet
![Page 19: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/19.jpg)
http://parquet.io
Differences of Parquet and ORC Nesting support• Parquet:
Repetition/Definition levels capture the structure.=> one column per Leaf in the schema.
Array<int> is one column.Nullity/repetition of an inner node is stored in each of its children
=> One column independently of nesting with some redundancy.
• ORC:An extra column for each Map or List to record their size.
=> one column per Node in the schema.Array<int> is two columns: array size and content.
=> An extra column per nesting level.
Document
DocId Links
Backward Forward
![Page 20: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/20.jpg)
http://parquet.io
20
Reading assembled records
• Record level API to integrate with existing row based engines (Hive, Pig, M/R).
• Aware of dictionary encoding: enable optimizations.
• Assembles projection for any subset of the columns: only those are loaded from disc.
Document
DocId 20
Document
Links
Backward 10 30
Document
Links
Backward Forward10 30 80
Document
Links
Forward 80
a1
a2
a3
b1
b2
b3
a1
a2
a3
b1
b2
b3
![Page 21: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/21.jpg)
http://parquet.io
21
Projection push down• Automated in Pig and Hive:
Based on the query being executed only the columns for the fields accessed will be fetched.
• Explicit in MapReduce, Scalding and Cascading using globing syntax.Example: field1;field2/**;field4/{subfield1,subfield2}Will return:
field1
all the columns under field2subfield1 and 2 under field4 but not field3
![Page 22: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/22.jpg)
http://parquet.io
Reading columns
• To implement column based execution engine
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
• Encoded or decoded values: computing aggregations on integers is faster than on strings.
D<1 => Null
Row:
0
1
2
3
R=1 => same row
0
0
1
R D V
0
1
1
1
0
A
B
C
0 1 D
![Page 23: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/23.jpg)
http://parquet.io
Integration APIs• Schema definition and record materialization:
• Hadoop does not have a notion of schema, however Impala, Pig, Hive, Thrift, Avro, ProtocolBuffers do.
• Event-based SAX-style record materialization layer. No double conversion.
• Integration with existing type systems and processing frameworks: • Impala • Pig• Thrift and Scrooge for M/R, Cascading and Scalding • Cascading tuples• Avro• Hive• Spark
![Page 24: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/24.jpg)
http://parquet.io
Encodings• Bit packing:
• Small integers encoded in the minimum bits required• Useful for repetition level, definition levels and dictionary keys
• Run Length Encoding: • Used in combination with bit packing• Cheap compression• Works well for definition level of sparse columns.
• Dictionary encoding:• Useful for columns with few ( < 50,000 ) distinct values• When applicable, compresses better and faster than heavyweight algorithms (gzip, lzo, snappy)
• Extensible: Defining new encodings is supported by the format
01|11|10|00 00|10|10|001 3 2 0 0 2 2 0
8 11 1 1 1 1 1 1 1
![Page 25: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/25.jpg)
http://parquet.io
Parquet 2.0
• More encodings: compact storage without heavyweight compression• Delta encodings: for integers, strings and sorted dictionaries.• Improved encoding for strings and boolean.
• Statistics: to be used by query planners and predicate pushdown.
• New page format: to facilitate skipping ahead at a more granular level.
![Page 26: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/26.jpg)
http://parquet.io
Main contributors
Julien Le Dem (Twitter): Format, Core, Pig, Thrift integration, Encodings Nong Li, Marcel Kornacker, Todd Lipcon (Cloudera): Format, Impala Jonathan Coveney, Alex Levenson, Aniket Mokashi, Tianshuo Deng (Twitter): Encodings, projection push down Mickaël Lacour, Rémy Pecqueur (Criteo): Hive integration Dmitriy Ryaboy (Twitter): Format, Thrift and Scrooge Cascading integration Tom White (Cloudera): Avro integration Avi Bryant, Colin Marc (Stripe): Cascading tuples integration Matt Massie (Berkeley AMP lab): predicate and projection push down David Chen (Linkedin): Avro integration improvements
![Page 27: Parquet Strata/Hadoop World, New York 2013](https://reader030.vdocuments.site/reader030/viewer/2022020115/5480dfb65806b5e8108b45b6/html5/thumbnails/27.jpg)
http://parquet.io
27
How to contribute
Questions? Ideas? Want to contribute?
Contribute at: github.com/Parquet
Come talk to us.Cloudera
CriteoTwitter