cloudera impala

Cloudera ImpalaReal Time Query for HDFS and HBase

Alexander Alten-Lorenz, Cloudera INC

Thursday, July 4, 13

Beyond Batch

What is Impala

Capability

Architecture

Beyond Batch

For some things MapReduce is just too slowApache Hive:

MapReduce execution engineHigh-latency, low throughputHigh runtime overhead

Google realized this early on Analysts wanted fast, interactive results

Dremel

Google paper (2010)“scalable, interactive ad-hoc query system for analysis of read-only nested data”

Columnar storage formatDistributed scalable aggregation

“capable of running aggregation queries over trillion-row tables in seconds”

http://research.google.com/pubs/pub36632.html

Impala: Goals

General-purpose SQL query engine for HadoopFor analytical and transactional workloadsSupport queries that take μs to hoursRun directly with Hadoop

Collocated daemonsSame file formatsSame storage managers (NN, metastore)

Impala: Goals

High performanceC++runtime code generation (LLVM)direct access to data (no MapReduce)

Retain user experience easy for Hive users to migrate100% open-source

Impala: Capability

HiveQL (subset of SQL92)select, project, join, union, subqueries, aggregation, insert, alter, order by (with limit)DDL

Directly queries data in HDFS & HBaseText files (compressed)Sequence files (snappy/gzip)Avro & Parquet

Impala: Capability

Familiar and unified platformUses Hive’s metastoreSubmit queries via ODBC | Beeswax Thrift API

Query is distributed to nodes with relevant dataProcess-to-process data exchangeKerberos authenticationNo fault tolerance

Impala: Performance

Greater disk throughput~100MB/sec/diskI/O-bound workloads faster by 3-4x

Queries that require multiple map-reduce phases in Hive are significantly faster in Impala (up to 45x)Queries that run against in-memory cached data see a significant speedup (up to 90x)

Impala: Architecture

impaladruns on every nodehandles client requests (ODBC, thrift)handles query planning & execution

statestoredprovides name servicemetadata distributionused for finding data

Current limitations

1.0.1 (available since May 2013)No SerDesNo User Defined Functions (UDF’s)impalad’s read metastore at startup refresh metadata per command line

Futures

DDL support (CREATE)Rudimentary cost-based optimizer (CBO)metadata distribution through statestoredColumnar storage format like Dremel’s

Impala + Parquet = Dremel superset

impala-user@cloudera.orgalexander@cloudera.com

@mapreditmapredit.blogspot.com

Web: http://goo.gl/7sxdp

cloudera impala - hug karlsruhe, july 04, 2013

dremels impala

aggregation queries

x queries

nding data

relevant data process

query system

statestored metadata

memory cached data

Technology

cloudera impala: a modern sql engine for apache hadoop

cloudera impala + postgresql

cloudera jdbc driver for impala installation and ......

real time analytics using cloudera impala in manufacturing...

performance evaluation of cloudera impala (with comparison...

hbase and impala notes - munich hug - 20131017

introduction to cloudera impala

cloudera impala presentation

apache atlas reference - cloudera · cloudera, cloudera...

technical overview on cloudera impala

cloudera impala source code explanation and analysis

performance evaluation of cloudera impala ga

hug meetup impala 2.5 performance overview

evaluation of cloudera impala 1.1

cloudera impala: a modern sql engine for hadoop

cloudera jdbc driver for impala installation and...

cloudera impala overview (via scott leberknight)

combat cyber threats with cloudera impala & apache hadoop

simba odbc driver for cloudera impala installation and...