overview of the hive stinger initiative
DESCRIPTION
Dr. Eric N. Hanson, Principal Software Development Engineer at Microsoft and Apache Hive committer presents the recent improvements in HiveTRANSCRIPT
Overview of the Hive Stinger Initiative
Eric N. Hanson
Principal Software Development Engineer
Microsoft HDInsight Team
30 June 2014
What is Stinger? Umbrella term for…
• Faster query in Hive• ORC• Vectorization• Tez
• Better language features for analysis• Window functions etc.
Why Stinger?
• Hive has good functionality
• But it started out sloooowww
• Need to speed it up• keep it competitive • make it fun to use
ORC
• A good columnstore format
• Run length encoding, value encoding, dictionary encoding
• Layers stream compression over the top
• Written by Owen O’Malley
• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
Using ORC
• create table Tbl (col int) stored as orc;
• orc.compress default ZLIB
• See http://www.slideshare.net/oom65/orc-andvectorizationhadoopsummit
TPC-DS File Sizes
Page 6*Courtesy of Hortonworks
Vectorization
Page 7
How the code works (simplified)
Page 8
class LongColumnAddLongScalarExpression {int inputColumn;int outputColumn;long scalar;void evaluate(VectorizedRowBatch batch) {
long [] inVector =((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];outVector[i] = inVector[i] + scalar;
} } else {
for (int i = 0; i < batch.size; i++) {outVector[i] = inVector[i] + scalar;
} }
}}
}
No method callsLow instruction countCache locality to 1024 valuesNo pipeline stallsSIMD in Java 8
Vectorization and Compilation
• Vectorization “instructions” generated from templates
• Example’s:– Int add col-col
– Int add col-scalar
– Int add scalar-col
–Double add col-col
–Double add col-scalar
–Double add scalar-col
–And hundreds more!
• Pre-compilation of expressions
• Reduces # of function calls and instructions at runtime
• Expressions like (a + 2) / b are interpreted with these primitives
Example of vectorized template code
} else {
if (batch.selectedInUse) {
for(int j = 0; j != n; j++) {
int i = sel[j];
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
} else {
for(int i = 0; i != n; i++) {
outputVector[i] = vector1[i] <OperatorSymbol> vector2[i];
}
}
}
Using vectorization in Hive
• set hive.vectorized.execution.enabled = true;
• Run query over ORC
• Only works for scalar types
• https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution
• ~5X CPU reduction
Apache Tez (“Speed”)• Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
YARN ApplicationMaster to run DAG of Tez Tasks
Task with pluggable Input, Processor and Output
Tez Task - <Input, Processor, Output>
Task
ProcessorInput Output
*Courtesy of Hortonworks
Tez: Building blocks for scalable data processing
Classical ‘Map’ Classical ‘Reduce’
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Map Processor
HDFS Input
Sorted Output
Reduce Processor
Shuffle Input
HDFS Output
Reduce Processor
Shuffle Input
Sorted Output
*Courtesy of Hortonworks
Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-TezSELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)SELECT c.price
SELECT b.id
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,c.itemId
JOIN (a, c)
JOIN(a, b)GROUP BY a.state
COUNT(*)AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded writes to HDFS
*Courtesy of Hortonworks
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions–Hot containers ready for immediate use
–Removes task and job launch overhead (~5s – 30s)
• Hive–Session launch/shutdown in background (seamless, user not aware)
–Submits query plan directly to Tez Session
Native Hadoop service, not ad-hoc
*Courtesy of Hortonworks
Stinger Phase 3: Interactive Query In Hadoop
Page 16
Hive 10 Trunk (Phase 3)Hive 0.11 (Phase 1)
190xImprovement
1400s
39s
7.2s
TPC-DS Query 27
3200s
65s
14.9s
TPC-DS Query 82
200xImprovement
Query 27: Pricing Analytics using Star Schema Join Query 82: Inventory Analytics Joining 2 Large Fact Tables
All Results at Scale Factor 200 (Approximately 200GB Data)
*Courtesy of Hortonworks
How you can use Stinger enhancements
• Use Hive 13
• Use ORC: create table … stored as ORC
• Enable vectorization: set hive.vectorized.execution.enabled=true
• Enable Tez: set hive.execution.engine=tez
• See http://hortonworks.com/hadoop-tutorial/supercharging-
interactive-queries-hive-tez/
Reference(s)
• Stinger overview, Strata, fall 2013: http://www.slideshare.net/alanfgates/strata-stingertalk-oct2013?qid=09d16028-bd7e-47d8-8438-34f3242c6f0e&v=qf1&b=&from_search=1
Slides marked “Courtesy of Hortonworks” are from Hortonworks talks