performance evaluation of cloudera impala (with comparison to hive)

Cloudera impala Performance Evaluation

（with Comparison to Hive） Dec. 8, 2012

CELLANT Corp. R&D Strategy Division Yukinori SUDA

@sudabon

About Cloudera impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel

and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and

Bzip •  Directly access the data through a specialized distributed

query engine

Architecture •  State Store works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an

impalad daemon

System Environment •  Install via Cloudera Manager Free Edition

13 Servers1 Sever

・HDFS NameNode SecondaryNameNode

・MapReduceV1 JobTracker

・impala impalad impala-‐‑state-‐‑store (statestored)

・HDFS DataNode

・MapReduceV1 TaskTracker

・impala impalad

Master Slave

All servers are connected with 1Gbps Ethernet through an L2 switch

Server Specification

•  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory o  4GB

•  Disk o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS o  CentOS 6.2

Benchmark

•  Use CDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool

“HiBench” o  https://github.com/hibench

•  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows

•  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported)

•  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy

•  Comparison with job query latency o  Average job latency over 5 measurements

Modified Datasets •  Rankings table

o  12 million rows o  Schema

•  pageURL string •  pageRank int •  avgDuration int

•  Uservisits table o  100 million rows o  Schema

•  sourceIP string •  destURL string •  visitDate string •  adRevenue double

•  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int

Modified Query SELECT

sourceIP, sum(adRevenue) as totalRevenue, avg(pageRank)

FROM rankings R

JOIN ( SELECT

sourceIP, destURL, adRevenue

FROM uservisits UV

WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’

ON (R.pageURL = NUV.destURL)

GROUP BY sourceIP ORDER BY totalRevenue DESC LIMIT 1

Benchmark Result （Hive）

Benchmark Result （impala 0.2）

Benchmark Result （impala 0.3）

Conclusion •  Impala is over 10 times faster than MR + Hive

o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds

o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds

o  Hive •  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format

Thank you

performance evaluation of cloudera impala (with comparison to hive)

query sentence o

performance o

rcfile o

gb disk o

query coordinator

query capability

compression method o

hyper threading memory

Technology

cloudera jdbc driver for impala installation and ......

technical overview on cloudera impala

cloudera impala

cloudera data analyst training: using pig, hive, and...

cloudera jdbc driver for impala installation and ......

impala ha with f5 big-ip - cloudera

cloudera jdbc driver for impala - analytics | cloud ·...

cloudera jdbc driver for apache hive - machine learning ·...

simbajdbcdriverforcloudera impala ... · simba jdbc driver...

‹#› © cloudera, inc. all rights reserved. marcel...

setting up hadoop cluster with cloudera manager and impala

performance evaluation of cloudera impala 0.6 beta with...

cloudera impala + postgresql

hadoop ecosystem vorstellung der komponenten · open source...

cloudera impala #pyfes 2012.11.24

az adatok hatalma - bi...

cloudera jdbc driver for impala installation and...

cisco information server 8 · easily access them using new...

cloudera data analyst training: using pig, hive, and ... ·...

hue troubleshooting...cloudera runtime query execution...