impala and bigquery (1)
TRANSCRIPT
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 1/47
Impala and BigQuery
By David Gruzman
BigDataCraft.com
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 2/47
Impala and BigQuery
by David Gruzman
► Big Query is google's database service basedon the Dremel. Big Query is hosted by Google.
►Impala is open source database inspired by the
Dremel paper. Impala is part of the ClouderaHadoop distribution.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 3/47
Today agenda
► Overview of Dremel as a technology
► Overview of the BigQuery
► A few words about Impala
► DG Mediamind use case
► Deeper insights into Impala
► Conclusions► Q&A
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 4/47
Why dremel?
► Google is first who got MapReduce
► Google is first faced MapReduce main problem – latency. The problem was propagated to
engines on top of MapReduce also.► It is logical that Google was first who
approached it by developing real time query
capability for big data.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 5/47
How dremel is used in google
► Dremel is not replacement for the MapReduceor Tenzing but complements it. (Tenzing isGoogle's Hive)
► Analyst can make many fast queries usingDremel
► After getting good idea what is needed – runslow MapReduce (or SQL based onMapReduce) to get precise results
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 6/47
Why dremel is Unique
► Dremel with BigQuery built on top of it isprobably only Interactive big data query enginetoday.
► I mean that it is only engine capable to produceresults over terabytes of data in seconds!
► Main idea (my guess) that is harness huge
cluster of machines for the single query.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 7/47
Dremel as technology
Novel Hierarchical columnar format.
► LLVM based code generation.
► Distributed aggregation Tree
► In-situ data processing. (inside the storage)
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 8/47
Dremel : Aggregation tree
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 9/47
Dremel : Nested columnar format
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 10/47
Big Query
► Service built by google on top of the Dremelengine
► Only (known to me) query engine as a service
working with BigData.► Query time not depends on data size
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 11/47
BigQuery main capabilities
► Aggregations
► Join of big table to small table.
► Join of two big tables (recently added)
► Hierarchical data format. It makes pre-aggregations cheaper.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 12/47
Main limitations
► Small results size
► Intermediate results should not exceed memorysize.
► No “external tables”
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 13/47
Why BigQuery is not popular
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 14/47
So,why BigQuery is not popular
► Data is not created in google cloud. It is hardand not practical to move big data. It is heavy,after all.
► Google is used to change APIs. BigQuery alsochanged during last years. It is hard to buildbusines.
► Many companies in Internet related businessesa wary of sharing data with Google.
► It is expensive. 35$ per TB can give 1000th ofdollars bills per day.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 15/47
Dremel
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 16/47
In the same time – it is goodtechnically
► I got referances from company doing serioustesting
► Marting Fawler's company also tested it and
give very good feedback.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 17/47
Question to all of you
Why Your organization decided not to usegoogle's Big Query?
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 18/47
Where we can find Impala
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 19/47
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 20/47
What is impala
► Massive parralel processing (MPP) databaseengine, developed by Cloudera.
► Integrated into Hadoop stack on the same level
as MapReduce, and not above it (as Hive andPig)
HDFS
Map Reduce
HivePig
Impala
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 21/47
Why impala
► Data has a gravity
► Today a lot of data live in HDFS
► It is not practical to move big data
► It is practical to bring engine to the data
► In the same time – MapReduce is not must
►
Impala process data in Hadoop cluster without using MapReduce
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 22/47
MapReduce bypass
► Several other modern Database engines alsorealized the opportunity to bypass MapReducebut work right with HDFS.
► They takes various approaches.►
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 23/47
MapReduce Bypass
► Existing MPP databases, like Greenplum – store their external tables in the HDFS
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 24/47
MapReduce bypass
► Jethrodata store data in their own format onHDFS and also work with it without MR layer.
► They have their proprietary format which enable
full indexing of the data together with columnarefficiency. In cases of high selectivity queriesthis approach has serious advantages.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 25/47
Use Case from DG
I think it is will be typical case in the future
► DG is using Hadoop and Hive
► Evaluation Impala to do part of things more
efficiently.
► After their case presentation we will back todiscuss insights of the Impala
A i I l h diff t l
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 26/47
Again – Impala has different placethen Pig and Hive
HDFS
Map Reduce
Hive and Pig
Impala
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 27/47
Impala architecture
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 28/47
Impala – Dremel traces
► LLVM code generation
► It is really fast
► C++ as implementation language (not Java...)
► Simple query engine. It actually doing thingswhich can be done in memory.
► Broadcast join algorithm is implemented
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 29/47
LLVM code generation
► Assume you want to write custom code for thespecific query. It will be super efficient
► Code generation automate this process for
each query► We actually need to super-optimize inner loop
doing filtering (where) and group by.
► LLVM enables us to compile in fraction ofseconds into native code
► LLVM enable us to enjoy new CPU capabilitieslike SSE in a portable way.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 30/47
Why code generation it interesting?
► If you develop own engine, or some peace ofcode responsible to process serious datavolumes code generation may give you order ofmagnitude boost.
► I had cases when usage of such technologywas game changing
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 31/47
Impala – Hive Traces
► While dremel converts data into own format,Impala supports multiple formats. It is kind ofschema on read.
► Impala shares metastore with Hive, whichenables very simple adoption
► Internally Impala have well defined way to addnew formats
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 32/47
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 33/47
Impala vs MPP
► It usually tooks many years to create MPPdatabase.
► There are serious simplifications:
► The data is read only
► There is actually not DBMS – only queryengine.
► No serious resource management, butmeasurement (all over code).
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 34/47
Impala – hive killer?
► Not so quickly.
► Hive is doing things Impala can not do yet, like joins between several big tables.
► Hive has convinient java UDF, while impala isnot
► Impala does not have inter-query fault
tolerance.► In the same time – MapReduce is not good
framework for the database engine
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 35/47
Impala – Data Formats
► There are scanners for the following types:
► RCFile
► Parquet (native dremel format)
► CSV
► AVRO
► Sequence File
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 36/47
Impala – future
► Will get closer to other MPP engines
► Support more formats
► More advanced scheduling and resource
management
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 37/47
Basic benchmark
► TPC-H, Q1, SF=10
► 4 EC2 large instances
► 4 seconds, while hive takes about 1 minute.
► This number means group by speed of about235MB/sec per core.
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 38/47
Impala price per GB
► 1 Large instance costs $0.24
► Cluster costs 0.96 per hour.
► Cost of 1 second : 0.96 / 3600
► We process by such cluster 1.75GB per second
► So cost of 1 TB processing is about $0.15
► It is about 300 times cheaper then BigQuery
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 39/47
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 40/47
What with clouds?
I l i l d i t l ti
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 41/47
Impala in cloud is not elastic
► To be elastic we need to create cluster whenwe need it.
► Even if we agree to by hour resolution – storage
will be a problem► S3 will not give us hundreds of Mbs per second
per instance
►
To store data in local file system – is transient
I l l i
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 42/47
Impala - conclusions
► It is first time I remember when we can put ourhands on free MPP database.
► There is no risk to try it side-by-side with Hive
► It is possible to offload part of the work toImpala and do the rest with Hive
► It is part of the Cloudera Hadoop distribution
and easily installed by Cloudera Manager
M t i l d
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 43/47
Materials used
► Benchmarks
http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-20121208-
15536323https://amplab.cs.berkeley.edu/benchmark/
► Architecture
http://www.slideshare.net/scottleber/impala-19176906
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
M t i l d i
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 44/47
Material used - comparisons
► To hive: http://www.quora.com/Cloudera/Does-Cloudera-Impala-have-any-drawbacks-when-compared-with-Hive
► To vertica: http://www.quora.com/Cloudera-Impala/How-does-Cloudera-Impala-compare-to-Vertica
► To dremel: http://www.quora.com/Cloudera-
Impala/How-does-Clouderas-Impala-compare-to-Googles-Dremel
Th k !!!
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 45/47
Thank you!!!
► Special thanks to
► Faina Kamenetsky – who helped set up clustersin amazon.
BigDataCraft com
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 46/47
BigDataCraft.com
► We are boutique consulting company► Our services are:
► On paper POC
► On hardware POC
► Architecture / Design reviews
► Custom integrations and bug fixing
Impala Flow
8/12/2019 Impala and BigQuery (1)
http://slidepdf.com/reader/full/impala-and-bigquery-1 47/47
Impala - Flow