orc 2015: faster, better, smaller

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

ORC 2015: Faster, BeCer, Smaller Prasanth Jayachandran Apache Hive Team, Hortonworks @prasanth_j


Apache ORC – Optimized Row-Columnar File

Apache TLP – orc.apache.org +

Type Specific Encodings +

Came out of Apache Hive +

Vectorized Readers (Java, C++) + ProjecVon and Predicate Pushdown +

Columnar Storage +

Block Compression +

Hive ACID transacVons +

Single SerDe Format + Protobuf Metadata Storage +


ORC: Format SpecificaVon

How ORC stores data?


ORC File Layout

§  File Footer and Postscript

§  Stripes

§  Indexes (Row group indexes and Bloom Filter interleaved)

§  Min/Max stats, Positions for every 10K rows

§  Data §  Multiple streams per column encoded and

compressed independently

§  Stripe Footer

§  Locations to streams, type of encoding

§  Full specification at [1]


ORC Writer

Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>

§  One tree writer per flattened column

§  Multiple streams per column §  PRESENT

§  DATA

§  LENGTH

§  DICTIONARY_DATA

§  SECONDARY

§  ROW_INDEX

§  BLOOM_FILTER


ORC Data Streams

Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  Streams can be suppressed. §  Example: PRESENT stream is suppressed when all values in a stripe are non-null.

IS_PRESENT DATA DICTIONARY LENGTH SECONDARY

Compression Buffers


ORC: Features Timeline

How ORC improved over <me?


Timeline

February 2013

§  Stinger Initiative Announcement* §  Roadmap to improve Apache Hive’s performance by 100x §  Delivered in 100% Apache Open Source

* http://hortonworks.com/blog/100x-faster-hive/

| 2013 | 2014 | 2015

SQL Engine

Vectorized SQL Engine

Columnar Storage

ORC

+ + Distributed Execution

Apache Tez

= 100x


Timeline

March 2013

Optimized Row Columnar (ORC) file format committed to Hive §  Hive version: 0.11 §  Native data format in Hive

| 2013 | 2014 | 2015


Timeline

March 2013

| 2013 | 2014 | 2015

Predicate Pushdown §  SARG interface §  Prune stripes and row groups based on min/max statistics

Improved Run Length Encoding §  Tighter bit packing §  Longer runs §  DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE


Run Length Encoding Improvements

RLE (hive 0.11) RLE (hive >= 0.12)

Compression

RaVo Encoding Time (in

ms) Decoding Time (in

ms) Compression

RaVo Encoding Time (in

ms) Decoding Time (in

ms)

Twi$er Census API ID (24,556,361 records) 2.32 1770 1263 6.97 1558 864

HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125

Github Archive (root.payload.name.txt.dict-‐len) 114.05 21 15 260.73 23 15

AOL Querylog Epoch (36,389,577 records) 2.51 553 364 3.7 652 246

Reference: h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-‐Compression-‐RaWo-‐Comparison.xlsx


Timeline

April 2013

| 2013 | 2014 | 2015

Vectorized ORC readers §  Read and process columns in batches of size 1024

Null stream suppression §  Suppress PRESENT stream if no nulls in a stripe §  Enables fast path in vectorization

June 2013


Timeline

October 2013

| 2013 | 2014 | 2015

Statistics Interface §  Writer – Update statistics during load time §  Reader – ANALYZE TABLE .. NOSCAN

Split Elimination §  Stripe level column statistics §  Eliminate stripes that do not satisfy predicate conditions

November 2013


Timeline

February 2014

| 2013 | 2014 | 2015

Zero copy read path §  HDFS caching APIs to read directly into memory without extra data copies

Serialization Improvements §  Bit width alignment (trade-off space for speed) §  Unrolled bit packing and unpacking §  Buffered double reader and writer

June 2014


Serialization Improvements

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 24 32 40 48 56 64

Mea

n Ti

me

(ms)

Bit Width

ORC Read Integer Performance (smaller is better)

hive 0.13 unpacking

hive-1.0 unpacking (new)


Serialization Improvements

241.679

171.045 174.163

0

50

100

150

200

250

300

hive <= 0.13 buffered + BE buffered + LE

Mea

n Ti

me

(ms)

Double Read Modes

ORC Read Double Performance (smaller is better)

~1.4x improvement


Timeline

June 2014

| 2013 | 2014 | 2015

Adaptive compression buffer size §  >1000 columns adjust compression buffer size based on available memory §  Avoids wide table OOMs

Fast stripe level file merging §  Many small files to few large files §  No Decompression, No Decoding §  ALTER TABLE … CONCATENATE

July 2014


Fast File Merging

1091

651

245

816

0

200

400

600

800

1000

1200

1400

1600

ORC RCFile

Tota

l Tim

e in

sec

onds

CONCAT Supporting File Formats

ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better)

Merge Time

Load Time

1336 1467

~3.33x improvement in merge time


Timeline

July 2014

| 2013 | 2014 | 2015

ORC Padding Improvements §  Pad bytes to avoid remote HDFS reads §  Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage)

Decouple stripe size vs block size §  Smaller stripes (64MB) §  More stripes per block (4 per block) §  Better parallelism & split elimination


Timeline

September 2014

| 2013 | 2014 | 2015

String Dictionary Improvements §  Row group level checking §  Remember decision across stripes §  Avoids expensive RBTree insertions


String Dictionary Improvements

767

540

0

100

200

300

400

500

600

700

800

900

hive <= 0.13 hive > 0.13

Tim

e in

sec

onds

Hive Version

String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

Load Time

~1.4x improvement


Timeline

September 2014

| 2013 | 2014 | 2015

Improved ZLIB compression §  Different streams compressed with different zlib strategies/levels §  Compress integers and doubles differently §  Data and Dictionary stream - Looks for smaller byte patterns §  All other streams - Less LZ77, More Huffman


ZLIB Improvements

178.5 172.2

225.1

0

50

100

150

200

250

ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY

Dat

a Si

ze in

GB

s

File Format + Compression Codec

Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

~4% improvement ~1.3x smaller


ZLIB Improvements

674

433 389

0

100

200

300

400

500

600

700

800

ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY

Dat

a Si

ze in

GB

s

File Format + Compression Codec

Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

~1.6x improvement Only ~10% slower than SNAPPY


Timeline

September 2014

| 2013 | 2014 | 2015

ACID transactions §  Order of millions of rows §  Not designed for OLTP requirements §  Streaming Ingest via Flume or Storm §  Atomically add base and delta directories §  Minor compaction – Merge many delta files §  Major compaction – Re-write base files to incorporate delta file changes

Broken pattern: Add Partitions for Atomicity -


Timeline

January 2015

| 2013 | 2014 | 2015

hasNull flag in ORC internal index §  Better pruning of row groups §  Improves the performance of SELECT .. WHERE column IS NULL;


hasNull in Index Improvement

Bytes Read: 208.77 GB vs 539 MB 66.73

7.87

0

10

20

30

40

50

60

70

80

hive < 1.1.0 hive >= 1.1.0

Exec

utio

n Ti

me

in s

econ

ds

Hive Version

select * from lineitem where l_shipdate is null (smaller is better)

Execution Time ~8.5x improvement


Timeline

February 2015

| 2013 | 2014 | 2015

Bloom Filter Index §  Much better row group pruning when compared to min/max §  Bloom filter evaluated after the fast Min/Max based elimination


Bloom Filter Indexes Improvements

5999989709

540,000

10,000

No Indexes Min-Max Indexes Bloomfilter Indexes

select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better)

Rows Read


Bloom Filter Indexes Improvements

74

4.5 1.34

No Indexes Min-Max Indexes Bloomfilter Indexes

select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better)

Time Taken (seconds)

~16x improvement

~3.3x improvement


Timeline

April 2015

| 2013 | 2014 | 2015

Split Strategies §  BI – Skip reading file footer §  ETL – Read and cache file footer §  HYBRID – Default. Chooses BI/ETL based on number of files and average file size §  Group splits based on columnar projection size instead of file size


Timeline

April 2015

| 2013 | 2014 | 2015

ORC became Apache Top Level Project §  C++ reader with contributions from Hortonworks, HP and Microsoft §  Column encryption to encrypt sensitive columns

http://orc.apache.org/


ORC: In ProducVon


ORC at Facebook

Saved more than 1,400 servers worth of storage. (2)

Compression i Compression raVo increased from 5x to 8x globally. (2)

Compression i


ORC at Spotify

16x less HDFS read when using ORC versus Avro.(3)

IO i 32x less CPU when using ORC versus Avro.(3)

CPU i


ORC at Yahoo!

6-‐50x speedup when using ORC versus Text File.(4)

Speedup i 1.6-‐30x speedup when using ORC versus RCFile.(4)

Speedup i


ORC: LLAP and Sub-‐second

ORC – Pushing for Sub-‐second


ORC: LLAP

- JIT Performance for short queries +

Row-‐group level caching +

Asynchronous IO Elevator +

+ MulV-‐threaded Column Vector processing +


ORC: Vectorization + SIMD

0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)

Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly:

§  AllocaVon free Vght inner loops enables JDK’s auto-‐vectorizaVon

§  Vectors can be filtered early in ORC

§  String dicVonary can be used to binary-‐search

§  Vectorized SIMD Join

§  Improves performance for single key joins

AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers


ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)

select * from tpch_1000.lineitem where l_orderkey=1212000001;


Questions

?

Interested? Stop by the Hortonworks booth to learn more


Endnotes (1)  hXps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-‐orc-‐

specORCFormatSpecifica<on (2)  hXps://code.facebook.com/posts/229861827208629/scaling-‐the-‐facebook-‐data-‐warehouse-‐to-‐300-‐pb/

(3)  hXp://www.slideshare.net/AdamKawa/a-‐perfect-‐hive-‐query-‐for-‐a-‐perfect-‐mee<ng-‐hadoop-‐summit-‐2014

(4)  hXp://www.slideshare.net/Hadoop_Summit/w-‐1205p230-‐aradhakrishnan-‐v3

orc 2015: faster, better, smaller

Engineering

allrightsreserved timeline

allrightsreserved orc2015

page1 hortonworksinc

page12 hortonworksinc

page13 hortonworksinc

page14 hortonworksinc

page15 hortonworksinc

page16 hortonworksinc