bigdatain real-time -...

32
BigData in Real-time TCloud Computing 天云趋势 孙振南 [email protected] Impala Introduction 2012/12/13 Beijing Apache Asia Road Show

Upload: others

Post on 20-May-2020

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

BigData in Real-time

TCloud Computing 天云趋势

孙振南[email protected]

Impala Introduction

2012/12/13Beijing

Apache Asia Road Show

Page 2: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Background (Disclaimer)

• Impala is NOT an Apache Software Foundation project yet

• Impala uses ASLv2

• The speaker (me) is NOT associated with Cloudera

Page 3: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

WHYImpala

Page 4: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

The Need For Speed

• What’s wrong with MapReduce?

– Batch oriented. Good at complex jobs

– But slow at startup & shuffle

– Programmer friendly

• How about Hive?

– SQL friendly. Still slow as …

• How about HBase?

– Slow data import

– No SQL

Page 5: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

The leads from the leader

• Google BigQuery, the service

– Based on Google Dremel, the paper

• SQL-like interface

• Interactive analysis of PBs data

• Query Execution Tree

– Tasks to sub-tasks, instead of identical distributed tasks

• Columnar storage based on nested ProtoBuffer data

– Faster traversing

• Amazon RedShift is another story…

Page 6: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Open source alternatives?

• Apache Drill

– No substantial progress

– Mailing list msg # droped 80% from Sep to Nov

• Berkeley Shark/Spark

– Shared memory based, good at iteration tasks

– Different component stack

• Cloudera Impala

Page 7: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Positioning

• Compared with MR, it’s all about trade-off

– Complexity or responsiveness

– General purpose or ad-hoc

• MPP-RDB paradigms on top of commodity DFS

– On par performance in some cases

– Extremely cheap

– Linear scalability

Page 8: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

WHATis

Impala

Page 9: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Features

• Distributed SQL on raw HDFS files– Select, where, aggregation, join, – Insert into/overwrite– Text and Sequence files

• Hive compatible “meta store” and interface

– Reuse Hive’s metadata schema, DDL and JDBC/ODBC driver

• Up to 90x times faster, compared with Hive– Purely I/O bound scenario, 3-4X– With joins, 7-45X– With memory cached, 20-90X

Page 10: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Status

• Announced at Oct/2012

– Now 0.3 at Dec/5

– Has been private beta for half-year

– Currently in public beta

– Target GA @ 2013 Q1

• Entirely developed by Cloudera (by now)

– In the past 2 years, 7 full time engineers

• Completely open source, ASLv2

Page 11: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Example$ hive

hive> CREATE TABLE sales (id STRING, item STRING, price int);

hive> load data local inpath ‘/store/sales.txt’ into table sales;

$ impala-shell –impalad=172.16.204.4:21000

[172.16.204.4:21000]> show tables

sales

[172.16.204.4:21000]> select * from sales where price > 100 limit 3

13221 COOKIE 138

38384 DIPER 287

85845 TV 737

Page 12: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Workflow Overview

HDFS DataNode

Impalad

HDFS DataNode

Impalad

HDFS DataNode

Impalad

HDFS DataNode

Impalad

CLIClient

ODBCDriver

Meta dataMySQL

Impala State Store

1-on-1 communication

Simplified multi-way communication

HueBeeswax

HDFS NameNode

Page 13: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

HOWthe Impalaspeed up

Page 14: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

‘MPP’ SQL

• PlanNode– Node of the Depth-First execution plan tree– Various types

• HDFS_SCAN_NODE, HBASE_SCAN_NODE• HASH_JOIN_NODE• AGGREGATION_NODE, SORT_NODE• EXCHANGE_NODE

• Fragment– Atomic executable unit, could be distributed– Contains one or more PlanNodes

• Depends on the data distribution and the SQL statement

Page 15: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

SQL breakdown sample

• There’s a saying that young single males don’t use

coupon or discount as much as others, is it true?

• We can compare the list price and sales price

• Items are a little bit expensive

• Buyers are young, single, male

• Live in major city

Page 16: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

SQL breakdown sample - tables

store_salesss_item_id

ss_customer_idss_sales_price

……customer

c_idc_gender

c_marital_statusc_city……

itemi_item_id

i_list_price……

Page 17: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

SQL breakdown sample – SQL statement

select i_item_id, i_list_price, avg(ss_sales_price) agg1

FROM store_sales

JOIN item on (store_sales.ss_item_id = item.i_item_id)JOIN customer on (store_sales.ss_customer_id = customer.c_id)

where

i_list_price > 1000 andc_gender = 'M' andc_marital_status = 'S‘ and

c_city in ('Beijing','Shanghai','Guangzhou')

group by i_item_id,

order by i_list_price

limit 1000

Page 18: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Execution plan tree

Scansales_store

Scancustomer

Scanitem

Join onitem_id

Join oncustomer_id

Group by &Aggregation

Sort & Limit

Page 19: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Execution plan tree – Distributed!

Scansales_store

Scancustomer

Scanitem

Join onitem_id

Join oncustomer_id

Group by &Local Agg

Total Aggregation

Sort / Limit

Distributable

Indistributable

Page 20: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

From execution plan to fragment

Scansales_store

Scancustomer

Scanitem

Join onitem_id

Join oncustomer_id

Group by &Aggregation

Total Aggregation

Sort / Limit

Page 21: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

From execution plan to fragment

Scansales_store

Scancustomer

Scanitem

Join onitem_id

Join oncustomer_id

Group by &Aggregation

Total Aggregation

Sort / Limit

Page 22: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

From execution plan to fragment

Scansales_store

Scancustomer

Scanitem

Join onitem_id

Join oncustomer_id

Group by &Aggregation

Total Aggregation

Sort / Limit

Page 23: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

THAT’S IT?

Page 24: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

There’s more…

• Written in C++– Only used jFlex/CUP to parse the SQL statement

• Local compilation of fragments– LLVM is used

• Disk awareness– Not just host awareness– dfs.datanode.hdfs-blocks-metadata.enabled– “40% faster”

• Direct read– Not via HDFS NameNode then DataNode then …– dfs.client.read.shortcircuit, dfs.client.read.shortcircuit.skip.checksum

Page 25: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

GOODENOUGH?

Page 26: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

TODOs

• No Data Definition Language yet

• No User Defined Function yet

• No fault tolerance yet

• Avro, RCFile, LZO, Trevni support is on the way– Impala + Trevni will introduce another performance boost

• With more SQL functions compared with BigQuery

• In memory Join only– Will be fixed in GA

• Partition before join, reduce traffic

• Support Hive partitions, but not buckets yet

Page 27: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

INSIDE

Page 28: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

Exec Engine

ImpaladPython CLI

JDBC/ODBC

Hue BeeswaxFrontend

Coordinator

Exec Engine

ThriftMeta data

MySQL

Another Impalad

22000

PlannerjFlex/CUP

22000Backend, ImpalaInternalService

25000

JNI

21000 (Frontend, ImpalaService)

2400023000Register

&Subscribe

State Store

HDFS

HBase

Page 29: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

All C++ code, both be and fe

LLVM IR generation

PlanNode implementation

Math, date, string, time, agg, etc

Coordinator, Executor, etc

main(), thrift server

State store

.thrift, very useful

Frontend, Planner

Python CLI shell

Page 30: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

趋势科技如何使用Hadoop

• 天云趋势由天云科技和趋势科技共同投资成立于2010年3月

• 趋势科技是Hadoop的重度使用者:

– 2006年开始使用, 用于处理Web Reputation和Email Reputation等

– 五个数据中心, 近1000个节点

– 日均处理3.6T日志数据

• 含HoneyPot收到的垃圾邮件但不含爬取的网页等

– 亚洲最早, 也是最大的代码贡献者

– HBase 0.92新功能的主要开发者 (coprocessors, security)

Page 31: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

天云趋势的BigData业务

– TCloud BigData Platform, 集部署、监控、配置优化、数据处理以及扩

展SDK库于一体的集群管理和开发平台

– 丰富的垂直行业开发经验,包括电信、电力、安全等

– 为多家大型传统软件开发厂商提供咨询服务

– Hadoop管理员、开发者培训,也可为企业内部进行定制化培训

http://www.tcloudcomputing.com.cn

Page 32: BigDatain Real-time - sizeofvoid.netsizeofvoid.net/wp-content/uploads/ImpalaIntroduction2.pdfBigDatain Real-time ... CLI Client ODBC Driver Metadata MySQL ImpalaState Store 1-on-1

@少年振南