performance evaluation of cloudera impala (with comparison to hive)
Post on 14-Jun-2015
4.243 Views
Preview:
TRANSCRIPT
Cloudera impala Performance Evaluation
(with Comparison to Hive) Dec. 8, 2012
CELLANT Corp. R&D Strategy Division Yukinori SUDA
@sudabon
About Cloudera impala • Latest version is 0.3 beta • Open-sourced implementation inspired by Google Dremel
and F1 • Developed by famous Hadoop distributor Cloudera • Bring real-time, ad-hoc query capability on Apache Hadoop • Query data stored in HDFS or Apache Hbase • Use the same metadata, SQL syntax (HiveQL) as Apache Hive • Support for TextFile and SequenceFile as Hive storage format • Also support SequenceFile compressed as Snappy, Gzip and
Bzip • Directly access the data through a specialized distributed
query engine
Architecture • State Store works as an impala-state-store(statestored) daemon • Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon
System Environment • Install via Cloudera Manager Free Edition
13 Servers1 Sever
・HDFS NameNode SecondaryNameNode
・MapReduceV1 JobTracker
・impala impalad impala-‐‑state-‐‑store (statestored)
・HDFS DataNode
・MapReduceV1 TaskTracker
・impala impalad
Master Slave
All servers are connected with 1Gbps Ethernet through an L2 switch
Server Specification
• CPU o Intel Core 2 Duo 2.13 GHz with Hyper Threading
• Memory o 4GB
• Disk o 7,200 rpm SATA mechanical Hard Disk Drive
• OS o CentOS 6.2
Benchmark
• Use CDH4.1 + impala version 0.2 and 0.3 • Use hivebench in open-sourced benchmark tool
“HiBench” o https://github.com/hibench
• Modified datasets to 1/10 scale o Default configuration generates table with 1 billion rows
• Modified query sentence o Deleted “INSERT INTO TABLE …” to evaluate read-only performance o Deleted “datediff” function (I mistook not to be supported)
• Combines a few Hive storage format with a few compression method o TextFile, SequenceFile, RCFile o No compression, Gzip, Snappy
• Comparison with job query latency o Average job latency over 5 measurements
Modified Datasets • Rankings table
o 12 million rows o Schema
• pageURL string • pageRank int • avgDuration int
• Uservisits table o 100 million rows o Schema
• sourceIP string • destURL string • visitDate string • adRevenue double
• userAgent string • countryCode string • languageCode string • searchWord string • duration int
Modified Query SELECT
sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)
FROM rankings R
JOIN ( SELECT
sourceIP, destURL, adRevenue
FROM uservisits UV
WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’
) NUV
ON (R.pageURL = NUV.destURL)
GROUP BY sourceIP ORDER BY totalRevenue DESC LIMIT 1
Benchmark Result (Hive)
Benchmark Result (impala 0.2)
Benchmark Result (impala 0.3)
Conclusion • Impala is over 10 times faster than MR + Hive
o Impala 0.3 • SequenceFile compressed as Snappy: 14.337 seconds
o Impala 0.2 • SequenceFile compressed as Gzip: 19.733 seconds
o Hive • RCFile compressed as Snappy: 164.161 seconds
• Hope that impala version 1.0 included in CDH5 makes faster o Support RCFile and Trevni columner format
Thank you
top related