cloudera impala + postgresql

Post on 14-Dec-2014

1.186 Views

Category:

Technology

11 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hacking Cloudera Impala for running on PostgreSQL cluster as MPP style. Performances under typical sql stmt and concurrence case are verified.

TRANSCRIPT

Running Cloudera Impala on PostgreSQL

By Chengzhong Liuliuchengzhong@miaozhen.com

2013.12

Story coming from…

• Data gravity• Why big data• Why SQL on big data

Today agenda

• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A

What happened in miaozhen

• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster

• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …

Before Hadoop

• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan

• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression

Problem & Chance

• Database cluster• SQL on Hadoop• Miscellaneous data

• Requirements– Most data is rational– SQL interface

SQL on Hadoop

• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal

HDFS

Map Reduce

HivePig

Impala/Drill/Pivotal/Presto

Latency matters

What’s this

• A kind of MPP engine• In memory processing• Small to big join– Broadcast join

• Small result size

Why Cloudera Impala

• The team move fast– UDF coming out– Better join strategy on the way

• Good code base– Modularize– Easy to add sub classes

• Really fast– Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree– In-situ data processing (inside storage)

Typical Arch.SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Our target

• A MPP database– Build on PostgreSQL9.1– Scale well– Speed

• A mixed data source MPP query engine– Join two tables in different sources– In fact…

Hacking… from where

• Add, not change– Scan Node type– DB Meta info

• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode

Front end

• Meta store update– Link data to the table name– Table location management

• Front end– Compute table location

Back end

• Coordinator– pg host

• New scan node type– db scan node• Pg scan node• Psql library using cursor

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id) from table

– MR like process

Env.

• Ads impression logs– 150 millions, 100KB/line

• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN

• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’

Performance

1 2 30

100

200

300

400

500

600

700

impalahivepg+impala

• Group by speed / core• 20 M /s

With index

Codegen on/off

uv_test distinct duplicated0

10

20

30

40

50

60

70

80

90

100

en_codegendis_codegen

• select count(distinct id) from t group by c

• select distinct idfrom t

• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Multi-users

Conclusion

• Source quality– Readable– Google C++ style– Robust

• MPP solution based on PG– Proved perf.– Easy to scale

• Mixed engine usage– HDFS and DB

What’s next

• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over

Thanks!

Q & A

top related