programming hive reading #4
DESCRIPTION
TRANSCRIPT
![Page 1: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/1.jpg)
Programming Hive Reading #4
@just_do_neet
![Page 2: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/2.jpg)
![Page 3: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/3.jpg)
Programming Hive Reading #4
Chapter 11. and 15.
•Chapter 11. ‘Other File Formats and Compression’
•Choosing / Enabling / Action / HAR / etc...
•Chapter 15. ‘Customizing Hive File and Record Formats’
•Demystifying DML / File Formats / etc...
•exclude "SerDe" related topics at this presentation...
3
![Page 4: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/4.jpg)
Programming Hive Reading #4
#11 Determining Installed Codecs
4
$ hive -e "set io.compression.codecs"io.compression.codecs= org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, com.hadoop.compression.lzo.LzoCodec, org.apache.hadoop.io.compress.SnappyCodec
![Page 5: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/5.jpg)
Programming Hive Reading #4
#11 Choosing a Compression Codec
•Advantage :
•network I/O , disk space.
•Disadvantage :
•CPU overhead.
•to be short... : Trade-off
5
![Page 6: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/6.jpg)
Programming Hive Reading #4
#11 Choosing a Compression Codec
•“why do we need different compression schemes?”
•speed
•minimizing size
•‘splittable’ or not.
6
![Page 7: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/7.jpg)
Programming Hive Reading #4
#11 Choosing a Compression Codec
•“why do we need different compression schemes?”
7
http://comphadoop.weebly.com/
![Page 8: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/8.jpg)
Programming Hive Reading #4
take a break : algorithm
•lossless compression
•LZ77(LZSS), LZ78, etc...
•DEFLATE (LZ77 with Huffman coding)
•LZH (LZ77 with Static Huffman coding)
•BZIP2(Burrows–Wheeler transform, Move-to-Front, Huffman Coding)
•lossy
•for JPEG, MPEG,etc...(snip.)8
![Page 9: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/9.jpg)
Programming Hive Reading #4
take a break : algorithm
9
http://www.slideshare.net/moaikids/ss-2638826
![Page 10: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/10.jpg)
Programming Hive Reading #4
take a break : algorithm
10
http://www.slideshare.net/moaikids/ss-2638826
![Page 11: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/11.jpg)
Programming Hive Reading #4
take a break : algorithm
•Burrows–Wheeler Transform(BWT)
•block sorting
•“abracadabra” = bwt“ard$rcaaabb”
11
abracadabra$bracadabra$aracadabra$abacadabra$abrcadabra$abraadabra$abracdabra$abracaabra$abracadbra$abracadara$abracadaba$abracadabr$abracadabra
$abracadabraa$abracadabrabra$abracadabracadabra$acadabra$abradabra$abracbra$abracadabracadabra$acadabra$abradabra$abracara$abracadabracadabra$ab
$aaaaabbcdrr
ard$rcaaaabb
abracadabra$
![Page 12: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/12.jpg)
Programming Hive Reading #4
take a break : algorithm
•BWT with Suffix Array
•ref. http://d.hatena.ne.jp/naoya/20081016/1224173077
•ref. http://hillbig.cocolog-nifty.com/do/files/2005-12-compInd.ppt
12
![Page 13: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/13.jpg)
Programming Hive Reading #4
take a break : algorithm
•LZO
•“Compression is comparable in speed to DEFLATE compression.”
•“Very fast decompression”• http://www.oberhumer.com/opensource/lzo/
13
![Page 14: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/14.jpg)
Programming Hive Reading #4
take a break : algorithm
•Google Snappy
•“very high speeds and reasonable compression”
• https://code.google.com/p/snappy/
•ref.http://www.slideshare.net/KeigoMachinaga/snappy-servay-8665889
14
![Page 15: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/15.jpg)
Programming Hive Reading #4
take a break : algorithm
•LZ4
•“very fast lossless compression algorithm”• https://code.google.com/p/lz4/
•ref.http://www.slideshare.net/komiyaatsushi/dsirnlp-3-lz4
15
![Page 16: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/16.jpg)
Programming Hive Reading #4
take a break : algorithm
•“Add support for LZ4 compression”
•fix version : 0.23.1, 0.24.0,(CDH4)
•ref. https://issues.apache.org/jira/browse/HADOOP-7657
16
![Page 17: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/17.jpg)
Programming Hive Reading #4
take a break : Implementation Codec
17
public HogeCodec implements CompressionCodec{ @Override public CompressionOutputStream createOutputStream(OutputStream out, Compressor compressor) throws IOException { return new BlockCompressorStream(out, compressor, bufferSize, compressionOverhead); }
@Override public Class<? extends Compressor> getCompressorType() { return HogeCompressor.class; }
@Override public CompressionOutputStream createOutputStream(OutputStream out) throws IOException { return createOutputStream(out, createCompressor()); }
@Override public Compressor createCompressor() { return new HogeCompressor(); }
@Override public CompressionInputStream createInputStream(InputStream in) throws IOException { return createInputStream(in, createDecompressor()); }............
ref.http://hadoop.apache.org/
docs/current/api/org/apache/hadoop/io/compress/
CompressionCodec.html
![Page 18: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/18.jpg)
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•Final Output Compression(hive, mapred)
18
![Page 19: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/19.jpg)
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•setting enable flag
19
![Page 20: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/20.jpg)
Programming Hive Reading #4
#11 Enabling Compression
•Intermediate Compression(hive, mapred)
•setting codec
20
![Page 21: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/21.jpg)
Programming Hive Reading #4
#11 Enabling Compression
•Final Output Compression(hive, mapred)
•setting enable flag
21
![Page 22: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/22.jpg)
Programming Hive Reading #4
#11 Enabling Compression
•Final Output Compression(hive, mapred)
•setting codec
22
![Page 23: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/23.jpg)
Programming Hive Reading #4
#11 Sequence File
•Sequence File Format
• Header
• Record
• Record length
• Key length
• Key
• Value
• A sync-marker every few 100 bytes or so.http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
23
![Page 24: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/24.jpg)
Programming Hive Reading #4
#11 Sequence File
•Compression Type
•NONE : nothing to do
•RECORD : compress on each records
•BLOCK : compress on each blocks
24
![Page 25: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/25.jpg)
Programming Hive Reading #4
#11 Compression in Action
•(DEMO)
25
![Page 26: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/26.jpg)
Programming Hive Reading #4
#11 Archive Partition
•Using ‘HAR’
•ref. http://hadoop.apache.org/docs/r1.0.4/hadoop_archives.html
•Archiving
•Unarchiving
26
$ SET hive.archive.enabled=true;$ ALTER TABLE hoge ARCHIVE PARTITION(folder=‘fuga’)
$ ALTER TABLE hoge UNARCHIVE PARTITION(folder=‘fuga’)
![Page 27: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/27.jpg)
Break :)
![Page 28: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/28.jpg)
Programming Hive Reading #4
#15 Record Format
•TEXTFILE
•SEQUENCEFILE
•RCFILE
28
CREATE TABLE hoge (.........)STORED AS [TEXTFILE|SEQUENCEFILE|RCFILE]
![Page 29: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/29.jpg)
Programming Hive Reading #4
#15 Record Format
•RCFile(Record Columnar File)
•fast data loading
•fast query processing
•highly efficient storage space utilization
•a strong adaptivity to dynamic data access patterns.
•ref. "A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems (ICDE’11)"http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-4.pdf
29
![Page 30: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/30.jpg)
Programming Hive Reading #4
#15 Record Format
•RCFile Format
•1 record = some Row Group
•1 HDFS Block = some Row Group
•Row Group•a sync marker•metadata header•table data
•uses the RLE algorithm to compress ‘metadata header’ section.
30
![Page 31: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/31.jpg)
Programming Hive Reading #4
#15 Record Format
•Implementation of RCFile
•Input Format
•o.a.h.h.ql.io.RCFileInputFormat
•Output Format
•o.a.h.h.ql.io.RCFileOutputFormat
•SerDe
•o.a.h.h.serde2.columnar.ColumnarSerDe
31
![Page 32: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/32.jpg)
Programming Hive Reading #4
#15 Record Format
•Tuning of RCFile
•“hive.io.rcfile.record.buffer.size”
•define “RowGroup” size(default: 4MB)
32
![Page 33: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/33.jpg)
Programming Hive Reading #4
#15 Record Format
•ref. “HDFS and Hive storage - comparing file formats and compression methods”
• http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/
•"In term of file size, the “RCFILE” format with the “default” and “gz” compression achieve the best results."
•"In term of speed, the “RCFILE” formats with the “lzo” and “snappy” are very fast while preserving a high compression rate."
33
![Page 34: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/34.jpg)
Programming Hive Reading #4
#Appendix - trevni
•ref. https://github.com/cutting/trevni/
•ref. http://avro.apache.org/docs/current/trevni/spec.html
34
![Page 35: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/35.jpg)
Programming Hive Reading #4
#Appendix - trevni
35
file header
file
magic number of rows
number of columns
file header
column ......column column column column column column
file metadata
number of blocks ......block block
column
block descriptor
row row row ...... row
block
number of rows
uncompressed bytes
compressed bytes
block descriptor
column metadata
column start position
・name・type・codec・etc...
column metadata
![Page 36: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/36.jpg)
Programming Hive Reading #4
#Appendix - ORCFile
•ref. http://hortonworks.com/blog/100x-faster-hive/
•ref. https://issues.apache.org/jira/browse/HIVE-3874
•ref. https://issues.apache.org/jira/secure/attachment/12564124/OrcFileIntro.pptx
36
![Page 37: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/37.jpg)
Programming Hive Reading #4
#Appendix - ORCFile
•ref. data size
37
![Page 38: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/38.jpg)
Programming Hive Reading #4
#Appendix - ORCFile
•ref. comparison
38
![Page 39: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/39.jpg)
Programming Hive Reading #4
#Appendix - Column-Oriented Storage
•ref. http://arxiv.org/pdf/1105.4252.pdf
39
![Page 40: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/40.jpg)
Programming Hive Reading #4 40
#Appendix - more informations
http://scholar.google.co.jp/scholar?hl=ja&q=hdfs+columnar&btnG=&lr=
![Page 41: Programming Hive Reading #4](https://reader034.vdocuments.site/reader034/viewer/2022052306/547d050bb4af9faf158b528e/html5/thumbnails/41.jpg)
Thanks for your listening :)