impala and bigquery (1)

47
Impala and BigQuery By David Gruzman BigDataCraft.com

Upload: durdurk

Post on 03-Jun-2018

246 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 1/47

Impala and BigQuery

By David Gruzman

BigDataCraft.com

Page 2: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 2/47

 Impala and BigQuery

by David Gruzman

► Big Query is google's database service basedon the Dremel. Big Query is hosted by Google.

►Impala is open source database inspired by the

Dremel paper. Impala is part of the ClouderaHadoop distribution.

Page 3: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 3/47

Today agenda

► Overview of Dremel as a technology

► Overview of the BigQuery

► A few words about Impala

► DG Mediamind use case

► Deeper insights into Impala

► Conclusions►  Q&A

Page 4: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 4/47

Why dremel?

► Google is first who got MapReduce

► Google is first faced MapReduce main problem – latency. The problem was propagated to

engines on top of MapReduce also.► It is logical that Google was first who

approached it by developing real time query

capability for big data.

Page 5: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 5/47

How dremel is used in google

► Dremel is not replacement for the MapReduceor Tenzing but complements it. (Tenzing isGoogle's Hive)

► Analyst can make many fast queries usingDremel

► After getting good idea what is needed – runslow MapReduce (or SQL based onMapReduce) to get precise results

Page 6: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 6/47

Why dremel is Unique

► Dremel with BigQuery built on top of it isprobably only Interactive big data query enginetoday.

► I mean that it is only engine capable to produceresults over terabytes of data in seconds!

► Main idea (my guess) that is harness huge

cluster of machines for the single query.

Page 7: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 7/47

Dremel as technology

Novel Hierarchical columnar format.

► LLVM based code generation.

► Distributed aggregation Tree

► In-situ data processing. (inside the storage)

Page 8: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 8/47

Dremel : Aggregation tree

Page 9: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 9/47

Dremel : Nested columnar format

Page 10: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 10/47

Big Query

► Service built by google on top of the Dremelengine

► Only (known to me) query engine as a service

working with BigData.► Query time not depends on data size

Page 11: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 11/47

BigQuery main capabilities

► Aggregations

► Join of big table to small table.

► Join of two big tables (recently added)

► Hierarchical data format. It makes pre-aggregations cheaper.

Page 12: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 12/47

Main limitations

► Small results size

► Intermediate results should not exceed memorysize.

► No “external tables” 

Page 13: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 13/47

Why BigQuery is not popular

Page 14: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 14/47

So,why BigQuery is not popular

► Data is not created in google cloud. It is hardand not practical to move big data. It is heavy,after all.

► Google is used to change APIs. BigQuery alsochanged during last years. It is hard to buildbusines.

► Many companies in Internet related businessesa wary of sharing data with Google.

► It is expensive. 35$ per TB can give 1000th ofdollars bills per day.

Page 15: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 15/47

Dremel

Page 16: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 16/47

In the same time – it is goodtechnically

► I got referances from company doing serioustesting

► Marting Fawler's company also tested it and

give very good feedback.

Page 17: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 17/47

Question to all of you

Why Your organization decided not to usegoogle's Big Query?

Page 18: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 18/47

Where we can find Impala

Page 19: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 19/47

Page 20: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 20/47

What is impala

► Massive parralel processing (MPP) databaseengine, developed by Cloudera.

► Integrated into Hadoop stack on the same level

as MapReduce, and not above it (as Hive andPig)

HDFS

Map Reduce

HivePig

Impala

Page 21: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 21/47

Why impala

► Data has a gravity

► Today a lot of data live in HDFS

► It is not practical to move big data

► It is practical to bring engine to the data

► In the same time – MapReduce is not must

Impala process data in Hadoop cluster without using MapReduce

Page 22: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 22/47

MapReduce bypass

► Several other modern Database engines alsorealized the opportunity to bypass MapReducebut work right with HDFS.

► They takes various approaches.►  

Page 23: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 23/47

MapReduce Bypass

► Existing MPP databases, like Greenplum – store their external tables in the HDFS

Page 24: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 24/47

MapReduce bypass

► Jethrodata store data in their own format onHDFS and also work with it without MR layer.

► They have their proprietary format which enable

full indexing of the data together with columnarefficiency. In cases of high selectivity queriesthis approach has serious advantages.

Page 25: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 25/47

Use Case from DG

I think it is will be typical case in the future

► DG is using Hadoop and Hive

► Evaluation Impala to do part of things more

efficiently.

► After their case presentation we will back todiscuss insights of the Impala

A i I l h diff t l

Page 26: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 26/47

 Again – Impala has different placethen Pig and Hive

HDFS

Map Reduce

Hive and Pig

Impala

Page 27: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 27/47

Impala architecture

Page 28: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 28/47

Impala – Dremel traces

► LLVM code generation

► It is really fast

► C++ as implementation language (not Java...)

► Simple query engine. It actually doing thingswhich can be done in memory.

► Broadcast join algorithm is implemented

Page 29: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 29/47

LLVM code generation

► Assume you want to write custom code for thespecific query. It will be super efficient

► Code generation automate this process for

each query► We actually need to super-optimize inner loop

doing filtering (where) and group by.

► LLVM enables us to compile in fraction ofseconds into native code

► LLVM enable us to enjoy new CPU capabilitieslike SSE in a portable way.

Page 30: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 30/47

Why code generation it interesting?

► If you develop own engine, or some peace ofcode responsible to process serious datavolumes code generation may give you order ofmagnitude boost.

► I had cases when usage of such technologywas game changing

Page 31: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 31/47

Impala – Hive Traces

► While dremel converts data into own format,Impala supports multiple formats. It is kind ofschema on read.

► Impala shares metastore with Hive, whichenables very simple adoption

► Internally Impala have well defined way to addnew formats

Page 32: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 32/47

Page 33: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 33/47

Impala vs MPP

► It usually tooks many years to create MPPdatabase.

► There are serious simplifications:

► The data is read only

► There is actually not DBMS – only queryengine.

► No serious resource management, butmeasurement (all over code).

Page 34: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 34/47

Impala – hive killer?

► Not so quickly.

► Hive is doing things Impala can not do yet, like joins between several big tables.

► Hive has convinient java UDF, while impala isnot

► Impala does not have inter-query fault

tolerance.► In the same time – MapReduce is not good

framework for the database engine

Page 35: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 35/47

Impala – Data Formats

► There are scanners for the following types:

► RCFile

► Parquet (native dremel format)

► CSV

► AVRO

► Sequence File

Page 36: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 36/47

Impala – future

► Will get closer to other MPP engines

► Support more formats

► More advanced scheduling and resource

management

Page 37: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 37/47

Basic benchmark

► TPC-H, Q1, SF=10

► 4 EC2 large instances

► 4 seconds, while hive takes about 1 minute.

► This number means group by speed of about235MB/sec per core.

Page 38: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 38/47

Impala price per GB

► 1 Large instance costs $0.24

► Cluster costs 0.96 per hour.

► Cost of 1 second : 0.96 / 3600

► We process by such cluster 1.75GB per second

► So cost of 1 TB processing is about $0.15

► It is about 300 times cheaper then BigQuery

Page 39: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 39/47

Page 40: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 40/47

What with clouds?

I l i l d i t l ti

Page 41: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 41/47

Impala in cloud is not elastic

► To be elastic we need to create cluster whenwe need it.

► Even if we agree to by hour resolution – storage

will be a problem► S3 will not give us hundreds of Mbs per second

per instance

To store data in local file system – is transient

I l l i

Page 42: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 42/47

Impala - conclusions

► It is first time I remember when we can put ourhands on free MPP database.

► There is no risk to try it side-by-side with Hive

► It is possible to offload part of the work toImpala and do the rest with Hive

► It is part of the Cloudera Hadoop distribution

and easily installed by Cloudera Manager

M t i l d

Page 43: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 43/47

Materials used

► Benchmarks

http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-20121208-

15536323https://amplab.cs.berkeley.edu/benchmark/

► Architecture

http://www.slideshare.net/scottleber/impala-19176906

https://cloud.google.com/files/BigQueryTechnicalWP.pdf

M t i l d i

Page 44: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 44/47

Material used - comparisons

► To hive: http://www.quora.com/Cloudera/Does-Cloudera-Impala-have-any-drawbacks-when-compared-with-Hive

► To vertica: http://www.quora.com/Cloudera-Impala/How-does-Cloudera-Impala-compare-to-Vertica

► To dremel: http://www.quora.com/Cloudera-

Impala/How-does-Clouderas-Impala-compare-to-Googles-Dremel

Th k !!!

Page 45: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 45/47

Thank you!!!

► Special thanks to

► Faina Kamenetsky – who helped set up clustersin amazon.

BigDataCraft com

Page 46: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 46/47

BigDataCraft.com

► We are boutique consulting company► Our services are:

► On paper POC

► On hardware POC

► Architecture / Design reviews

► Custom integrations and bug fixing

Impala Flow

Page 47: Impala and BigQuery (1)

8/12/2019 Impala and BigQuery (1)

http://slidepdf.com/reader/full/impala-and-bigquery-1 47/47

Impala - Flow