orc 2015: faster, better, smaller

42
Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ORC 2015: Faster, BeCer, Smaller Prasanth Jayachandran Apache Hive Team, Hortonworks @prasanth_j

Upload: the-apache-software-foundation

Post on 28-Jul-2015

595 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: ORC 2015: Faster, Better, Smaller

Page  1   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC  2015:  Faster,  BeCer,  Smaller  Prasanth  Jayachandran Apache  Hive  Team,  Hortonworks @prasanth_j

Page 2: ORC 2015: Faster, Better, Smaller

Page  2   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Apache ORC – Optimized Row-Columnar File

Apache  TLP  –  orc.apache.org  +

Type  Specific  Encodings  +

Came  out  of  Apache  Hive  +

Vectorized  Readers  (Java,  C++)  + ProjecVon  and  Predicate  Pushdown  +

Columnar  Storage  +

Block  Compression  +

Hive  ACID  transacVons  +

Single  SerDe  Format  + Protobuf  Metadata  Storage  +

Page 3: ORC 2015: Faster, Better, Smaller

Page  3   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC:  Format  SpecificaVon  

How  ORC  stores  data?  

Page 4: ORC 2015: Faster, Better, Smaller

Page  4   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC File Layout

§  File Footer and Postscript

§  Stripes

§  Indexes (Row group indexes and Bloom Filter interleaved)

§  Min/Max stats, Positions for every 10K rows

§  Data §  Multiple streams per column encoded and

compressed independently

§  Stripe Footer

§  Locations to streams, type of encoding

§  Full specification at [1]

Page 5: ORC 2015: Faster, Better, Smaller

Page  5   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC Writer

Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>

§  One tree writer per flattened column

§  Multiple streams per column §  PRESENT

§  DATA

§  LENGTH

§  DICTIONARY_DATA

§  SECONDARY

§  ROW_INDEX

§  BLOOM_FILTER

Page 6: ORC 2015: Faster, Better, Smaller

Page  6   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC Data Streams

Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> §  Streams can be suppressed. §  Example: PRESENT stream is suppressed when all values in a stripe are non-null.

IS_PRESENT DATA DICTIONARY LENGTH SECONDARY

Compression Buffers

Page 7: ORC 2015: Faster, Better, Smaller

Page  7   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC:  Features  Timeline  

How  ORC  improved  over  <me?  

Page 8: ORC 2015: Faster, Better, Smaller

Page  8   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

February 2013

§  Stinger Initiative Announcement* §  Roadmap to improve Apache Hive’s performance by 100x §  Delivered in 100% Apache Open Source

* http://hortonworks.com/blog/100x-faster-hive/

| 2013 | 2014 | 2015

SQL Engine

Vectorized SQL Engine

Columnar Storage

ORC

+   +  Distributed Execution

Apache Tez

= 100x

Page 9: ORC 2015: Faster, Better, Smaller

Page  9   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

March 2013

Optimized Row Columnar (ORC) file format committed to Hive §  Hive version: 0.11 §  Native data format in Hive

| 2013 | 2014 | 2015

Page 10: ORC 2015: Faster, Better, Smaller

Page  10   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

March 2013

| 2013 | 2014 | 2015

Predicate Pushdown §  SARG interface §  Prune stripes and row groups based on min/max statistics

Improved Run Length Encoding §  Tighter bit packing §  Longer runs §  DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE

Page 11: ORC 2015: Faster, Better, Smaller

Page  11   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Run Length Encoding Improvements

    RLE  (hive  0.11)   RLE  (hive  >=  0.12)  

   Compression  

RaVo  Encoding  Time  (in  

ms)  Decoding  Time  (in  

ms)  Compression  

RaVo  Encoding  Time  (in  

ms)  Decoding  Time  (in  

ms)  

Twi$er  Census  API  ID  (24,556,361  records)   2.32   1770   1263   6.97   1558   864  

HTTP  Archive  (bytes.json)   79.4   198   191   200.82   263   125  

Github  Archive  (root.payload.name.txt.dict-­‐len)   114.05   21   15   260.73   23   15  

AOL  Querylog  Epoch  (36,389,577  records)   2.51   553   364   3.7   652   246  

Reference:  h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-­‐Compression-­‐RaWo-­‐Comparison.xlsx  

Page 12: ORC 2015: Faster, Better, Smaller

Page  12   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

April 2013

| 2013 | 2014 | 2015

Vectorized ORC readers §  Read and process columns in batches of size 1024

Null stream suppression §  Suppress PRESENT stream if no nulls in a stripe §  Enables fast path in vectorization

June 2013

Page 13: ORC 2015: Faster, Better, Smaller

Page  13   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

October 2013

| 2013 | 2014 | 2015

Statistics Interface §  Writer – Update statistics during load time §  Reader – ANALYZE TABLE .. NOSCAN

Split Elimination §  Stripe level column statistics §  Eliminate stripes that do not satisfy predicate conditions

November 2013

Page 14: ORC 2015: Faster, Better, Smaller

Page  14   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

February 2014

| 2013 | 2014 | 2015

Zero copy read path §  HDFS caching APIs to read directly into memory without extra data copies

Serialization Improvements §  Bit width alignment (trade-off space for speed) §  Unrolled bit packing and unpacking §  Buffered double reader and writer

June 2014

Page 15: ORC 2015: Faster, Better, Smaller

Page  15   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Serialization Improvements

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 8 16 24 32 40 48 56 64

Mea

n Ti

me

(ms)

Bit Width

ORC Read Integer Performance (smaller is better)

hive 0.13 unpacking

hive-1.0 unpacking (new)

Page 16: ORC 2015: Faster, Better, Smaller

Page  16   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Serialization Improvements

241.679

171.045 174.163

0

50

100

150

200

250

300

hive <= 0.13 buffered + BE buffered + LE

Mea

n Ti

me

(ms)

Double Read Modes

ORC Read Double Performance (smaller is better)

~1.4x improvement

Page 17: ORC 2015: Faster, Better, Smaller

Page  17   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

June 2014

| 2013 | 2014 | 2015

Adaptive compression buffer size §  >1000 columns adjust compression buffer size based on available memory §  Avoids wide table OOMs

Fast stripe level file merging §  Many small files to few large files §  No Decompression, No Decoding §  ALTER TABLE … CONCATENATE

July 2014

Page 18: ORC 2015: Faster, Better, Smaller

Page  18   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Fast File Merging

1091

651

245

816

0

200

400

600

800

1000

1200

1400

1600

ORC RCFile

Tota

l Tim

e in

sec

onds

CONCAT Supporting File Formats

ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better)

Merge Time

Load Time

1336 1467

~3.33x improvement in merge time

Page 19: ORC 2015: Faster, Better, Smaller

Page  19   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

July 2014

| 2013 | 2014 | 2015

ORC Padding Improvements §  Pad bytes to avoid remote HDFS reads §  Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage)

Decouple stripe size vs block size §  Smaller stripes (64MB) §  More stripes per block (4 per block) §  Better parallelism & split elimination

Page 20: ORC 2015: Faster, Better, Smaller

Page  20   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

September 2014

| 2013 | 2014 | 2015

String Dictionary Improvements §  Row group level checking §  Remember decision across stripes §  Avoids expensive RBTree insertions

Page 21: ORC 2015: Faster, Better, Smaller

Page  21   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

String Dictionary Improvements

767

540

0

100

200

300

400

500

600

700

800

900

hive <= 0.13 hive > 0.13

Tim

e in

sec

onds

Hive Version

String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

Load Time

~1.4x improvement

Page 22: ORC 2015: Faster, Better, Smaller

Page  22   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

September 2014

| 2013 | 2014 | 2015

Improved ZLIB compression §  Different streams compressed with different zlib strategies/levels §  Compress integers and doubles differently §  Data and Dictionary stream - Looks for smaller byte patterns §  All other streams - Less LZ77, More Huffman

Page 23: ORC 2015: Faster, Better, Smaller

Page  23   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ZLIB Improvements

178.5 172.2

225.1

0

50

100

150

200

250

ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY

Dat

a Si

ze in

GB

s

File Format + Compression Codec

Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

~4% improvement ~1.3x smaller

Page 24: ORC 2015: Faster, Better, Smaller

Page  24   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ZLIB Improvements

674

433 389

0

100

200

300

400

500

600

700

800

ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY

Dat

a Si

ze in

GB

s

File Format + Compression Codec

Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better)

~1.6x improvement Only ~10% slower than SNAPPY

Page 25: ORC 2015: Faster, Better, Smaller

Page  25   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

September 2014

| 2013 | 2014 | 2015

ACID transactions §  Order of millions of rows §  Not designed for OLTP requirements §  Streaming Ingest via Flume or Storm §  Atomically add base and delta directories §  Minor compaction – Merge many delta files §  Major compaction – Re-write base files to incorporate delta file changes

Broken pattern: Add Partitions for Atomicity -

Page 26: ORC 2015: Faster, Better, Smaller

Page  26   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

January 2015

| 2013 | 2014 | 2015

hasNull flag in ORC internal index §  Better pruning of row groups §  Improves the performance of SELECT .. WHERE column IS NULL;

Page 27: ORC 2015: Faster, Better, Smaller

Page  27   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

hasNull in Index Improvement

Bytes Read: 208.77 GB vs 539 MB 66.73

7.87

0

10

20

30

40

50

60

70

80

hive < 1.1.0 hive >= 1.1.0

Exec

utio

n Ti

me

in s

econ

ds

Hive Version

select * from lineitem where l_shipdate is null (smaller is better)

Execution Time ~8.5x improvement

Page 28: ORC 2015: Faster, Better, Smaller

Page  28   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

February 2015

| 2013 | 2014 | 2015

Bloom Filter Index §  Much better row group pruning when compared to min/max §  Bloom filter evaluated after the fast Min/Max based elimination

Page 29: ORC 2015: Faster, Better, Smaller

Page  29   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Bloom Filter Indexes Improvements

5999989709

540,000

10,000

No Indexes Min-Max Indexes Bloomfilter Indexes

select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better)

Rows Read

Page 30: ORC 2015: Faster, Better, Smaller

Page  30   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Bloom Filter Indexes Improvements

74

4.5 1.34

No Indexes Min-Max Indexes Bloomfilter Indexes

select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better)

Time Taken (seconds)

~16x improvement

~3.3x improvement

Page 31: ORC 2015: Faster, Better, Smaller

Page  31   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

April 2015

| 2013 | 2014 | 2015

Split Strategies §  BI – Skip reading file footer §  ETL – Read and cache file footer §  HYBRID – Default. Chooses BI/ETL based on number of files and average file size §  Group splits based on columnar projection size instead of file size

Page 32: ORC 2015: Faster, Better, Smaller

Page  32   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Timeline

April 2015

| 2013 | 2014 | 2015

ORC became Apache Top Level Project §  C++ reader with contributions from Hortonworks, HP and Microsoft §  Column encryption to encrypt sensitive columns

http://orc.apache.org/

Page 33: ORC 2015: Faster, Better, Smaller

Page  33   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC:  In  ProducVon  

Page 34: ORC 2015: Faster, Better, Smaller

Page  34   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC at Facebook

Saved  more  than  1,400  servers  worth  of  storage.  (2)  

Compression  i Compression  raVo  increased  from  5x  to  8x  globally.  (2)  

Compression  i

Page 35: ORC 2015: Faster, Better, Smaller

Page  35   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC at Spotify

   

16x  less  HDFS  read  when  using  ORC  versus  Avro.(3)  

IO  i 32x  less  CPU  when  using  ORC  versus  Avro.(3)  

CPU  i

Page 36: ORC 2015: Faster, Better, Smaller

Page  36   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC at Yahoo!

   

6-­‐50x  speedup  when  using  ORC  versus  Text  File.(4)  

Speedup  i 1.6-­‐30x  speedup  when  using  ORC  versus  RCFile.(4)  

Speedup  i

Page 37: ORC 2015: Faster, Better, Smaller

Page  37   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC:  LLAP  and  Sub-­‐second  

ORC  –  Pushing  for  Sub-­‐second    

Page 38: ORC 2015: Faster, Better, Smaller

Page  38   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC: LLAP

- JIT  Performance  for  short  queries  +

Row-­‐group  level  caching  +

Asynchronous  IO  Elevator  +

+ MulV-­‐threaded  Column  Vector  processing  +

Page 39: ORC 2015: Faster, Better, Smaller

Page  39   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC: Vectorization + SIMD

   0x00007f13d2e6afb0:  vmovdqu  0x10(%rsi,%rax,8),%ymm2      0x00007f13d2e6afb6:  vaddpd  %ymm1,%ymm2,%ymm2      0x00007f13d2e6afba:  movslq  %eax,%r10      0x00007f13d2e6afbd:  vmovdqu  0x30(%rsi,%r10,8),%ymm3      ;*daload  vector.expressions.gen.DoubleColAddDoubleColumn::evaluate  (line  94)    

Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly:

§  AllocaVon  free  Vght  inner  loops  enables  JDK’s  auto-­‐vectorizaVon  

§  Vectors  can  be  filtered  early  in  ORC  

§  String  dicVonary  can  be  used  to  binary-­‐search  

§  Vectorized  SIMD  Join  

§  Improves  performance  for  single  key  joins  

AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers

Page 40: ORC 2015: Faster, Better, Smaller

Page  40   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)

select  *  from  tpch_1000.lineitem  where  l_orderkey=1212000001;  

Page 41: ORC 2015: Faster, Better, Smaller

Page  41   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Questions

?    

Interested?  Stop  by  the  Hortonworks  booth  to  learn  more  

Page 42: ORC 2015: Faster, Better, Smaller

Page  42   ©  Hortonworks  Inc.  2011  –  2015.  All  Rights  Reserved  

Endnotes (1)  hXps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-­‐orc-­‐

specORCFormatSpecifica<on  (2)  hXps://code.facebook.com/posts/229861827208629/scaling-­‐the-­‐facebook-­‐data-­‐warehouse-­‐to-­‐300-­‐pb/  

(3)  hXp://www.slideshare.net/AdamKawa/a-­‐perfect-­‐hive-­‐query-­‐for-­‐a-­‐perfect-­‐mee<ng-­‐hadoop-­‐summit-­‐2014  

(4)  hXp://www.slideshare.net/Hadoop_Summit/w-­‐1205p230-­‐aradhakrishnan-­‐v3