cloudera impala + postgresql

Running Cloudera Impala on PostgreSQL

By Chengzhong Liuliuchengzhong@miaozhen.com

2013.12

Story coming from…

• Data gravity• Why big data• Why SQL on big data

Today agenda

• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A

What happened in miaozhen

• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster

• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …

Before Hadoop

• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan

• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression

Problem & Chance

• Database cluster• SQL on Hadoop• Miscellaneous data

• Requirements– Most data is rational– SQL interface

SQL on Hadoop

• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal

Map Reduce

HivePig

Impala/Drill/Pivotal/Presto

Latency matters

What’s this

• A kind of MPP engine• In memory processing• Small to big join– Broadcast join

• Small result size

Why Cloudera Impala

• The team move fast– UDF coming out– Better join strategy on the way

• Good code base– Modularize– Easy to add sub classes

• Really fast– Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree– In-situ data processing (inside storage)

Typical Arch.SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Our target

• A MPP database– Build on PostgreSQL9.1– Scale well– Speed

• A mixed data source MPP query engine– Join two tables in different sources– In fact…

Hacking… from where

• Add, not change– Scan Node type– DB Meta info

• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode

Front end

• Meta store update– Link data to the table name– Table location management

• Front end– Compute table location

Back end

• Coordinator– pg host

• New scan node type– db scan node• Pg scan node• Psql library using cursor

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id) from table

– MR like process

• Ads impression logs– 150 millions, 100KB/line

• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN

• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’

Performance

1 2 30

impalahivepg+impala

• Group by speed / core• 20 M /s

With index

Codegen on/off

uv_test distinct duplicated0

en_codegendis_codegen

• select count(distinct id) from t group by c

• select distinct idfrom t

• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Multi-users

Conclusion

• Source quality– Readable– Google C++ style– Robust

• MPP solution based on PG– Proved perf.– Easy to scale

• Mixed engine usage– HDFS and DB

What’s next

• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over

• Cloudera Impala online doc. & src• http://files.meetup.com/1727991/Impala%20

and%20BigQuery.ppt• http://www.cubrid.org/blog/dev-platform/me

et-impala-open-source-real-time-sql-querying-on-hadoop/

• http://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/slides/Impala%20tech%20talk.pdf

• @datascientist, @dongxicheng, @flyingsk, @zhh

Thanks!

cloudera impala + postgresql

Technology

cloudera impala source code explanation and analysis

technical overview on cloudera impala

cloudera jdbc driver for impala installation and ......

performance evaluation of cloudera impala ga

cloudera impala - las vegas big data meetup nov 5th 2014

cloudera jdbc driver for impala installation and ... ·...

impala cookbook 01-2017 - cloudera blog · •as of cdh...

introduction to cloudera impala

cloudera impala seminar jan. 8 2013

1 cloudera impala and improvements in hdfs for real-time...

presentations from the cloudera impala meetup on aug 20 2013

cloudera impala technical deep dive

‹#› © cloudera, inc. all rights reserved. marcel...

cloudera impala technical overview

simbajdbcdriverforcloudera impala ... · simba jdbc driver...

cloudera jdbc driver for impala installation and...

real time analytics using cloudera impala in manufacturing...

刘诚忠：running cloudera impala on postgre sql

cloudera impala presentation

cloudera impala - san diego big data meetup august 13th 2014