presentations from the cloudera impala meetup on aug 20 2013

1

Parquet Update/UDFs in Impala Nong Li So:ware Engineer, Cloudera

Agenda

2

•  Parquet •  File format descripBon •  Benchmark Results in Impala •  Parquet 2.0

•  UDF/UDAs

3

Parquet

Data Pages

8

•  Values are stored in data pages as a triple: DefiniBon Level, RepeBBon Level and Value.

•  These are stored conBguous on disk => 1 seek to read a column regardless of nesBng.

•  Data pages are stored with different encodings:

•  Bit packing and Run Length Encoding (RLE) •  DicBonary for strings

•  Extended to all types in Parquet 1.1 •  Plain (liWle endian encoding) for naBve types.

Parquet 2.0

9

•  AddiBonal Encodings •  Group VarInt (for small ints) •  Improved string storage format •  Delta Encoding (for strings and ints)

•  AddiBonal Metadata •  Sorted files •  Page/Column/File StaBsBcs

•  Expected to further reduce on disk size and allow for skipping values on the read path.

Hardware Setup

10

•  10 Nodes •  16 Core Xeon •  48 GB Ram •  12 Disks •  CDH4.3 •  Impala 1.1

TPC-‐H lineitem table @ 1TB scale factor

11

0

100

200

300

400

500

600

700

800

Text Text w/ Lzo Seq w/ Snappy Avro w/ Snappy RcFile w/ Snappy Parquet w/ Snappy Seq w/ Gzip

Size (GB)

Query Times on TPC-‐H lineitem table

12

0

100

200

300

400

500

600

700

800

1 Column 3 Columns 5 Columns 16 (all) Columns 5 Columns, 3 Clients

Tpch Q1 (7 Columns)

Bytes Read Q1 (GB)

Text

Seq w/ Snappy

Avro w/ Snappy

RcFile w/ Snappy

Parquet w/ Snappy

Query Times on TPCDS Queries

13

0

50

100

150

200

250

300

350

400

450

500

Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96

Second

s

Text

Seq w/ Snappy

RC w/Snappy

Parquet w/Snappy

Average Times (Geometric Mean) •  Text: 224 seconds •  Seq Snappy: 257 seconds •  RC Snappy: 150 seconds •  Parquet: 61 seconds

Agenda

14

•  Parquet •  File format descripBon •  Benchmark Results in Impala • What’s Next

•  UDF/UDAs (Work in Progress)

Terminology

15

•  UDF: Tuple -‐> Scalar user-‐defined funcBon

•  E.g. Substring

•  UDA/UDAF: {Tuple} -‐> Scalar user-‐defined aggregate funcBon

•  E.g. Min

•  UDTF: {Tuple} -‐> {Tuple} user-‐defined table funcBon

Impala 1.2

16

•  Support Hive UDFs (java) •  ExisBng hive jars will run without a recompile.

•  Add Impala (naBve) UDFs and UDAs. •  New interface designed to execute as efficiently as possible for Impala.

•  Similar interface as Postgres UDFs/UDAs •  UDF/UDA registered for impala service in metadata catalog

•  i.e. CREATE FUNCTION/CREATE AGGREGATE

Example UDF

17

// This UDF adds two ints and returns an int. IntVal AddUdf(UdfContext* context,

const IntVal& arg1, const IntVal& arg2) { if (arg1.is_null || arg2.is_null) return IntVal::null(); return IntVal(arg1.val + arg2.val); }

DDL

18

CREATE statement will need to specify the UDF/UDA signature, the locaBon of the binary and the symbol for the UDF funBon.

CREATE FUNCTION substring(string, int, int) RETURNS string LOCATION “hdfs://path” “com.me.Substring” CREATE FUNCTION log(anytype) RETURNS anytype LOCATION “hdfs:://path2” “Log”

UDFs

19

•  Support for variadic args •  Support for polymorphic types

UDAs

20

•  UDA must implement typical state machine: •  Init() •  Update() •  Serialize() •  Merge() •  Finalize()

•  Data movement handled by Impala

UDA Example

21

// This is a sample of implementing the COUNT aggregate function. void Init(UdfContext* context, BigIntVal* val) { val-‐>is_null = false; val-‐>val = 0; }

void Update(UdfContext* context, const AnyVal& input, BigIntVal* val) { if (input.is_null) return; ++val-‐>val; }

void Merge(UdfContext* context, const BigIntVal& src, BigIntVal* dst) { dst-‐>val += src.val; }

BigIntVal Finalize(UdfContext* context, const BigIntVal& val) { return val; }

RunBme Code-‐GeneraBon

22

•  Impala uses LLVM to, at runBme, generate code to run the query. •  Takes into account constants that that are only

known a:er query analysis. •  Greatly improves CPU efficiency

•  NaBve UDFs/UDAs can benefit from this as well.

•  Instead of providing the UDF/UDA as a shared object, compile it (with CLANG) with an addiBonal flag and Impala to LLVM IR

•  IR will be integrated with the query execuBon. •  No funcBon call overhead for UDF/UDAs

LimitaBons

23

•  Hive UDAs/UDTFs not supported •  No UDTFs in naBve interface •  Can’t run out of process

•  NaBve interface is designed to support this, will be able to run without a recompile

• We’re planning to address this in Impala 1.3

Thanks!

24

• We’d love your feedback for UDFs/UDAs

•  QuesBons?

Performance Considerations

for Cloudera Impala

Henry Robinson [email protected] / @henryrImpala Meetup 2013-08-20

mailto:[email protected]


Agenda

● The basics: Performance Checklist● Review: How does Impala execute queries?● What makes queries fast (or slow)?● How can I debug my queries?

Impala Performance Checklist

● Verify – Simple count * query on a relatively big table and verify:○ Data locality, block locality, and NO check-summing (“Testing Impala

Performance”)○ Optimal IO throughput of HDFS scans (typically ~100 MB/s per disk)

● Stats – BOTH table and column stats, especially for:○ Joining two large tables○ Insert into as select through Impala

● Join table ordering – will be automatic in the Impala 2.0 wave. Until then:○ Largest table first○ Then most selective to least selective

● Monitor - monitor Impala queries to pinpoint slow queries and drill into potential issues○ CM 4.6 adds query monitoring○ CM 5.0 will have the next big enhancements

Part 1: How does Impala execute queries?

The basic idea

● Every Impala query runs across a cluster of multiple nodes, with lots of available CPU cores, memory and disk

● Best query speeds usually come when every node in the cluster has something to do

● Impala solves two basic problems: ○ Figure out what every node should do (compilation)○ Make them do it really quickly! (execution)

Query compilation

● a.k.a. ‘figuring out what every node should do’

● Impala compiles a SQL query into a plan describing what to execute, and where

● A plan is shaped like a tree. Data flows up from the leaves of the tree to the root.

● Each node in the tree is a query operator

● Impala chops this tree up into plan fragments

● Each node gets one or more plan fragments

Query execution

● Once started, each query operator can run independently of any other operator

● Every operator can be doing something at the same time

● This is the not-so-secret sauce for all massively parallel query execution engines

Part 2: What makes queries fast (or... slow)?

What determines performance?

● Data size

● Per-operator execution efficiency

● Available parallelism

● Available concurrency

● Hardware

● Schema design and file format

Data size

● More data means more work

● Not just the size of the disk-based data at plan leaves, but size of internal data flowing in to any operator

● How can you help?○ Partition your data

○ SELECT with LIMIT in subqueries

○ Push predicates down

○ Use correct JOIN order■ Gather table statistics

○ Use the right file format

● Tables are joined in the order listed in the FROM clause

● Impala uses left-deep trees for nested joins

● “Largest” table should be listed first○ largest = returning most rows before join filtering○ In a star schema, this is often the fact table

● Then list tables in order of most selective join filter to least selective○ Filter the most rows as early as possible

Table Ordering

Join Types

● Two types of join strategy are supported○ Broadcast○ Shuffle/Partitioned

● Broadcast○ Each node receives a full copy of the right table○ Per node memory usage = size of right table

● Shuffle○ Both sides of the join are partitioned○ Matching partitions sent to same node○ Per node memory usage = 1/nodes x size of right table

● Without column statistics, all joins are broadcast

Per-operator execution efficiency

● Impala is fast, and getting faster

● LLVM-based improvements

● More efficient disk scanners

● More modern algorithms from the DB literature

● How can you help?○ Upgrade to the latest version

Available parallelism

● Parallelism: number of resources available to use at once

● More hardware means more parallelism

● Impala will take advantage of more cores, disks and memory where possible

● Easiest (but most expensive!) way to improve performance of large class of queries

● You can scale up incrementally

Available concurrency

● Concurrency: how well can a query take advantage of available parallelism?

● Impala will take care of this mostly for you

● But some operators naturally don’t parallelise well in certain conditions

● For example: joining two huge tables together.○ The hash-node operators have to wait for one side to be read

completely before reading much of the other side

● How you can help:○ Read the profiles, look for obvious bottlenecks, rephrase if possible

Hardware

● Designed for modern hardware○ Leverages SSE 4.2 (Intel Nehalem or newer)○ LLVM Compiler Infrastructure○ Runtime Code Generation○ In-memory execution pipelines

● Today’s hardware○ 2 x Xeon E5 6 core CPUs○ 12 x 3 TB HDD○ 128 GB RAM

● How you can help:○ Use the supported platforms, with Cloudera’s

packages

Schema design

● PARTITION BY is an easy win

● In general, string is slower than fixed-width types (particularly for aggregations etc)

● File formats are crucial○ Experiment with Parquet for performance○ Avoid text

Supported File Formats

● Various HDFS file formats○ Text File (read/write)○ Avro (read)○ SequenceFile (read)○ RCFile (read)○ ParquetFile (read/write)

● Various compression codecs○ Snappy (ParquetFile, RCFile, SequenceFile, Avro)○ LZO (Text)○ Bzip (ParquetFile, RCFile, SequenceFile, Avro)○ Gzip (ParquetFile, RCFile, SequenceFile, Avro)

● HBase also supported

Partitioning Considerations

● Single largest performance feature○ Skips unnecessary data○ Requires queries contain partition keys as filters

● Choose a reasonable number of partitions○ Lots of small files becomes an issue○ Metadata overhead on NameNode○ Metadata overhead for Hive Metastore○ Impala caches this, but first load may take long

Part 3: Debugging queries

The Debug Pages

● Every impalad exports a lot of useful information on http://<impalad>:25000 (by default), including:○ Last 25 queries○ Active sessions○ Known tables○ Last 1MB of the log○ System metrics○ Query profiles

● Information-dense - not for the faint of heart!

Thanks! Questions?

Try It Out!● Apache-licensed open source

○ Impala 1.1 released 7/24/2013○ Impala 1.0 GA released 4/30/2013

● Questions/comments?○ Download: cloudera.com/impala○ Email: [email protected]○ Join: groups.cloudera.org○ MeetUp: meetup.com/Bay-Area-Impala-Users-

Group/


presentations from the cloudera impala meetup on aug 20 2013

Technology