1© Cloudera, Inc. All rights reserved.
Apache Impala 2.5 (Incubating)Performance improvements overview
2© Cloudera, Inc. All rights reserved.
Agenda
• What is Impala? • Impala at Apache• What is new in Impala 2.5 (CDH 5.7)• Impala performance update• Roadmap• Q&A
3© Cloudera, Inc. All rights reserved.
SQL-on-Hadoop engines
SQLImpala
SQL-on-Apache Hadoop – Choosing the right tool for the right job
4© Cloudera, Inc. All rights reserved.
• General-purpose SQL engine • Real-time queries in Apache Hadoop • General availability (v1.0) release out since April 2013 • Analytic SQL functionality (v2.0) since October 2014• Apache incubator project since December 2015• Previous release 2.3 (CDH 5.5) released November 2015
• Current release 2.5 (CDH 5.7) April 2016
What is Impala?
Today’s topic
5© Cloudera, Inc. All rights reserved.
• Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS• General-purpose SQL query engine:
• Targeted for analytical workloads• Supports queries that take from milliseconds to hours
• Runs directly within Hadoop: • reads widely used Hadoop file formats • talks to widely used Hadoop storage managers • runs on same nodes that run Hadoop processes • Highly available
• High performance: • C++ instead of Java • Run time code generation
Impala overview
6© Cloudera, Inc. All rights reserved.
Impala Use Cases
•Interactive BI/analytics on more data•Asking new questions – exploration, ML (Ibis)•Data processing with tight SLAs•Query-able archive w/full fidelity
7© Cloudera, Inc. All rights reserved.
• Incubator project since December 2015
• Development process slowly moving to ASF infrastructure (see IMPALA-3221)
• Help wanted!
Where to find the Impala community:
http://impala.io
@apacheimpala
Impala at Apache
8© Cloudera, Inc. All rights reserved.
New in Impala 2.5Usability Enhancements• Admission Control Improvements• Null-safe join/equals
Performance and Scalability• Runtime filters• Improved Cardinality Estimation and Join
Ordering• Query start-up improvements• Additional codegen and code
optimizations• Decimal arithmetic improvements• Fast min/max values on partition
columns(with query option)Integrations•Support for EMC DSSD
9© Cloudera, Inc. All rights reserved.
New in Impala 2.5Performance and Scalability
• Runtime filters• Improved Cardinality Estimation and Join
Ordering• Query start-up improvements• Additional codegen and code
optimizations• Decimal arithmetic improvements• Incremental metadata updates (DDL)• Fast min/max values on partition
columns(with query option)
Covered today
10© Cloudera, Inc. All rights reserved.
Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5)
• 2.2x speedup for TPC-H• 1.7x speedup for TPC-H (Nested)• 4.3X speedup for TPC-DS
11© Cloudera, Inc. All rights reserved.
Runtime filtering
• General idea: some predicates can only be computed at runtime
• Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND dt.d_moy = 12;
• How does Impala execute this query?
12© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast Join #1
290 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rows
13© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast Join #1
290 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rows
Runtime filters: the opportunity● The planner doesn’t know what the set of
ss_sold_date_sk and ss_item_sk contains - even with statistics.
● opportunity to save some work - why bother sending 43 billion of those rows to the joins?
● Runtime filters computes this predicate at runtime.
14© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast Join #1
290 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rowsStep 1: planner tells Join #1 to produce bloom filter qualifying i_item_sk & Join #2 to produce bloom filter for qualifying d_date_sk
15© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast Join #1
290 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rowsStep 2: Join reads all rows from build side (right input), and computes filter containing all distinct values of i_item_sk and d_date_sk
16© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast Join #1
290 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rowsStep 3: Join #1 & #2 sends filter to store_sales scan. Scan eliminates rows that don’t have a match in the bloom filters.
17© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast Join #1
47 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rows
store_sales scan uses bloom filter from Join #2 to filter out partitions (ss_sold_date_sk)and bloom filter from Join #1 to filter out rows that don’t qualify (ss_item_sk)
18© Cloudera, Inc. All rights reserved.
SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,store_sales,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12
GROUP BY dt.d_year,item.i_brand
ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast Join #1
47 million rows
date_dim
6,200 rows
Broadcast Join #2
Aggregate
47 million rows
914x reduction in number of rows coming out of scan43 billion -> 47 million
6x reduction in number of rows coming out of join290 million -> 47 million
19© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
43 billion rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
20© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
43 billion rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
21© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
43 billion rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
22© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
43 billion rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
Reduced customer rows by 826X
3.8 million to 4,600 rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
23© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
43 billion rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan
24© Cloudera, Inc. All rights reserved.
SELECT c_email_address,sum(ss_ext_sales_price) sum_agg
FROM store_sales,customer,customer_demographics
WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_addressORDER BY sum_agg DESC
Runtime filters variation : Global filters
ShuffleJoin #1
49 million rows
customer_demo
2,400 rows
BroadcastJoin #2
Aggregate
49 million rows
store_sales
49 million rows
customer
4,600 rows
Shuffle Shuffle
877x reduction in rows43 billion -> 49 million rows
set RUNTIME_FILTER_MODE=GLOBAL;
25© Cloudera, Inc. All rights reserved.
Runtime filters: real-world results
• Runtime filters can be highly effective. Some benchmark queries are more than 30 times faster in Impala 2.5.0.
• As always, depends on your queries, your schemas and your cluster environment.• By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They
can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL. • Other runtime filter parameters include :
• RUNTIME_BLOOM_FILTER_SIZE: [1048576]• RUNTIME_FILTER_WAIT_TIME_MS: [0]
26© Cloudera, Inc. All rights reserved.
Improved Cardinality Estimates and Join Order
1. More robust scan cardinality estimation• Mitigate correlated predicates (exponential backoff)
2. Improved join cardinality estimation• Special treatment of common case of PK/FK joins• Detect selective joins by applying the selectivity of build-side predicates to the
estimated join cardinality
• TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5)
SELECT * FROM cars WHERE cars.make = 'Toyota' AND cars.model = 'Camry'
27© Cloudera, Inc. All rights reserved.
Query start-up: performance impact
28© Cloudera, Inc. All rights reserved.
LLVM Codegen Support in Impala
Operations:• Hash join• Aggregation• Scans: Text, Sequence, Avro• Expressions in all operators• Sort• Top-N
Data Types:• TINYINT, SMALLINT, INT, BIGINT• FLOAT, DOUBLE• BOOLEAN• STRING, VARCHAR• DECIMALNew in Impala
2.5Extended in Impala 2.5
29© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-Nvoid* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. .
int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
30© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key}
Codegen code
• Perfectly unrolls “for each grouping column” loop• No switching on input type(s)• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
31© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key}
Codegen code
• Perfectly unrolls “for each grouping column” loop• No switching on input type(s)• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
10x more efficient code
32© Cloudera, Inc. All rights reserved.
Float/Double Vs Decimal?Pros for Float/Double
• Uses less memory.
• Faster because floating point math operations are natively supported by processors.(Note: Decimal uses fixed-point hardware types - int64 and __int128)
• Can represent a larger range of numbers.
Cons for Float/Double• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation
No go for applications requiring high precision & accuracy What about performance penalty?
33© Cloudera, Inc. All rights reserved.
Decimal arithmetic and aggregation
SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICEFROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag,
l_linestatus
3x speedup
● Simplified overflow check for decimal.● Extended Codegen framework to support aggregations involving decimal.● Bridged the performance gap between double and decimal
34© Cloudera, Inc. All rights reserved.
Network
Distributed Aggregations in Impala
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)from sales group by cust_id;
Scan ScanScan
• Impala aggregations have two phases:• Pre-aggregation phase• Merge phase
• The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value.• E.g. many sales per customer.
35© Cloudera, Inc. All rights reserved.
Network
Downsides of Pre-aggregations
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:• Memory• CPU cycles
• Pre-aggregations are not always effective at reducing network traffic
• E.g. select distinct for nearly-distinct rows• Pre-aggregations can spill to disk under
memory pressure• Disk I/O is bad - better to send to
merge agg rather than disk
36© Cloudera, Inc. All rights reserved.
Network
Streaming Pre-aggregations in Impala 2.5
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Reduction factor is dynamically estimated based on the actual data processed
• Pre-aggregation expands memory usage only if reduction factor is good
• Benefits:• Certain aggregations with low reduction
factor see speedups of up to 40%• Memory consumption can be reduced by
50% or more• Streaming pre-aggregations don’t spill to
disk
37© Cloudera, Inc. All rights reserved.
Streaming Pre-aggregations in Impala 2.5
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE 03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB 00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE 03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING 00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders
Baseline finished in 23.13 seconds
With stream pre-aggregation enabled finished in 14.9 seconds
38© Cloudera, Inc. All rights reserved.
Optimization for partition keys scan
• Use metadata to avoid table accesses for partition key scans:• select min(month), max(year) from functional.alltypes;• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword• partition keys only
01:AGGREGATE [FINALIZE] | output: min(month),max(year)| 00:UNION constant-operands=24
03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year)|02:EXCHANGE [UNPARTITIONED] |01:AGGREGATE| output: min(month), max(year)|00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization
39© Cloudera, Inc. All rights reserved.
21x node cluster each with Hardware ● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz● 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
Comparative Set● Impala 2.5
○ RUNTIME_FILTER_MODE = 2;● Spark SQL 1.6
○ Thrift JDBC server used to avoid startup cost ○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240
Workload● TPC-DS 15TB stored in Parquet file format (default of 256MB block size)● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98● Caveats:
○ Spark-SQL failed running : ■ Q25 : Bad plan ■ Q47 : StackOverflowError■ Q89 : StackOverflowError
Competitive benchmark : TPC-DS
40© Cloudera, Inc. All rights reserved.
Q25 (Fact to fact joins)SELECT i_item_id,i_item_desc, s_store_id, s_store_name, Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit) AS catalog_sales_profit FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk = sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10 AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001 GROUP BY i_item_id, i_item_desc, s_store_id, s_store_name ORDER BY i_item_id, i_item_desc, s_store_id, s_store_name LIMIT 100;
Competitive benchmark Query complexity varied from Q3SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
41© Cloudera, Inc. All rights reserved.
Competitive benchmark
42© Cloudera, Inc. All rights reserved.
Competitive benchmark
Impala 2.5 is 11x faster (based on geomean)
43© Cloudera, Inc. All rights reserved.
Performance Benchmark Takeaways• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements • Advantage expands for single-user vs just 10 users
• Spark SQL enables easier Spark application development• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)• CPU efficiency will increase in importance• Native code enables easy optimizations for CPU instruction sets
44© Cloudera, Inc. All rights reserved.
• Available today in Impala 2.5:• All the same Impala functionality, performance, and third-party integrations• Supported across our cloud partners• Deployment via Director• Modular architecture enables cloud’s decoupled storage and elasticity future
• Available soon in Impala 2.6:• Impala read/write to S3 in addition to local HDFS IMPALA-1878• Dynamically sized runtime filters• Parquet scanner optimization• Faster joins, aggregations, sorts and decimal arithmetic • Rack aware scheduling • Faster code generation
Impala and Cloud
45© Cloudera, Inc. All rights reserved.
Impala Roadmap2H 2015 1H 2016 2016
• SQL Support & Usability• Nested structures• Kudu updates (beta)
• Management & Security• Record reader service
(beta)• Finer-grained security
(Sentry)• Integration
• Isilon support• Python interface (Ibis)
• Performance & Scale• Improved predictability
under concurrency
• Performance & Scale• Continued scalability and
concurrency• Initial perf/scale
improvements• Management & Security
• Improved admission control
• Resource utilization and showback
• SQL Support & Usability• Dynamic partitioning
• Performance & Scale• >20x performance• Multi-threaded
joins/aggregations• Continued scale work
• Cloud• S3 read/write support
• Management & Security• Improved YARN
integration• Automated metadata
• SQL Support & Usability• Data type improvements• Added SQL extensions
46© Cloudera, Inc. All rights reserved.
Appendix.
47© Cloudera, Inc. All rights reserved.
48© Cloudera, Inc. All rights reserved.
• Pre Impala 2.5:• Coordinator starts receiving fragments before
senders• Problem:
• Serializes startup• Scale and plan complexity ~ slower startup
• Impala 2.5:• Coordinator starts fragments in any order• Added wait logic for senders and receivers
Query start-up improvements
49© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).First it selects an HDFS datanode to read from.
A B C
Selection will always start with the same replica to make optimal use of OS buffer caches.This can lead to hot-spots for some workloads.Improvement: Pick impalad at random.
50© Cloudera, Inc. All rights reserved.
New Query Option: random_replica
Disabled by default.set random_replica = 1;
Also has a corresponding query hint:SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA
*/;
51© Cloudera, Inc. All rights reserved.
Where It Can Help• Large number of small queries, each with few input tables.• High load on only one of multiple replicas of a table.• Queries are CPU bound.• Benefit: Distribute load more evenly over replicas.• Tradeoff: Distribution of local reads will increase buffer cache usage.
What’s Next• Add possibility to prefer remote reads.• Switch remote impalad selection from round-robin to load-based.• Add rack-awareness.
52© Cloudera, Inc. All rights reserved.
Catalog Improvements
Incrementally update table metadata instead of force-reloading all table metadata during DDL/DML operations
Reload metadata of only ‘dirty’ partitions
Reuse descriptors of HDFS files to avoid loading file/block metadata for files that haven’t been modified
Significantly reduce the latency of DDL/DML operations that change a small fraction of table metadata (e.g. alter table foo partition (year = 2010) set location ‘blah’)
53© Cloudera, Inc. All rights reserved.
Catalog Improvements - Results