presentations from the cloudera impala meetup on aug 20 2013
DESCRIPTION
Presentations from the Cloudera Impala meetup on Aug 20 2013: - Nong Li on Parquet+Impala and UDF support - Henry Robinson on performance tuning for ImpalaTRANSCRIPT
1
Parquet Update/UDFs in Impala Nong Li So:ware Engineer, Cloudera
Agenda
2
• Parquet • File format descripBon • Benchmark Results in Impala • Parquet 2.0
• UDF/UDAs
3
Parquet
4
5
6
7
Data Pages
8
• Values are stored in data pages as a triple: DefiniBon Level, RepeBBon Level and Value.
• These are stored conBguous on disk => 1 seek to read a column regardless of nesBng.
• Data pages are stored with different encodings:
• Bit packing and Run Length Encoding (RLE) • DicBonary for strings
• Extended to all types in Parquet 1.1 • Plain (liWle endian encoding) for naBve types.
Parquet 2.0
9
• AddiBonal Encodings • Group VarInt (for small ints) • Improved string storage format • Delta Encoding (for strings and ints)
• AddiBonal Metadata • Sorted files • Page/Column/File StaBsBcs
• Expected to further reduce on disk size and allow for skipping values on the read path.
Hardware Setup
10
• 10 Nodes • 16 Core Xeon • 48 GB Ram • 12 Disks • CDH4.3 • Impala 1.1
TPC-‐H lineitem table @ 1TB scale factor
11
0
100
200
300
400
500
600
700
800
Text Text w/ Lzo Seq w/ Snappy Avro w/ Snappy RcFile w/ Snappy Parquet w/ Snappy Seq w/ Gzip
Size (GB)
Query Times on TPC-‐H lineitem table
12
0
100
200
300
400
500
600
700
800
1 Column 3 Columns 5 Columns 16 (all) Columns 5 Columns, 3 Clients
Tpch Q1 (7 Columns)
Bytes Read Q1 (GB)
Text
Seq w/ Snappy
Avro w/ Snappy
RcFile w/ Snappy
Parquet w/ Snappy
Query Times on TPCDS Queries
13
0
50
100
150
200
250
300
350
400
450
500
Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q96
Second
s
Text
Seq w/ Snappy
RC w/Snappy
Parquet w/Snappy
Average Times (Geometric Mean) • Text: 224 seconds • Seq Snappy: 257 seconds • RC Snappy: 150 seconds • Parquet: 61 seconds
Agenda
14
• Parquet • File format descripBon • Benchmark Results in Impala • What’s Next
• UDF/UDAs (Work in Progress)
Terminology
15
• UDF: Tuple -‐> Scalar user-‐defined funcBon
• E.g. Substring
• UDA/UDAF: {Tuple} -‐> Scalar user-‐defined aggregate funcBon
• E.g. Min
• UDTF: {Tuple} -‐> {Tuple} user-‐defined table funcBon
Impala 1.2
16
• Support Hive UDFs (java) • ExisBng hive jars will run without a recompile.
• Add Impala (naBve) UDFs and UDAs. • New interface designed to execute as efficiently as possible for Impala.
• Similar interface as Postgres UDFs/UDAs • UDF/UDA registered for impala service in metadata catalog
• i.e. CREATE FUNCTION/CREATE AGGREGATE
Example UDF
17
// This UDF adds two ints and returns an int. IntVal AddUdf(UdfContext* context,
const IntVal& arg1, const IntVal& arg2) { if (arg1.is_null || arg2.is_null) return IntVal::null(); return IntVal(arg1.val + arg2.val); }
DDL
18
CREATE statement will need to specify the UDF/UDA signature, the locaBon of the binary and the symbol for the UDF funBon.
CREATE FUNCTION substring(string, int, int) RETURNS string LOCATION “hdfs://path” “com.me.Substring” CREATE FUNCTION log(anytype) RETURNS anytype LOCATION “hdfs:://path2” “Log”
UDFs
19
• Support for variadic args • Support for polymorphic types
UDAs
20
• UDA must implement typical state machine: • Init() • Update() • Serialize() • Merge() • Finalize()
• Data movement handled by Impala
UDA Example
21
// This is a sample of implementing the COUNT aggregate function. void Init(UdfContext* context, BigIntVal* val) { val-‐>is_null = false; val-‐>val = 0; }
void Update(UdfContext* context, const AnyVal& input, BigIntVal* val) { if (input.is_null) return; ++val-‐>val; }
void Merge(UdfContext* context, const BigIntVal& src, BigIntVal* dst) { dst-‐>val += src.val; }
BigIntVal Finalize(UdfContext* context, const BigIntVal& val) { return val; }
RunBme Code-‐GeneraBon
22
• Impala uses LLVM to, at runBme, generate code to run the query. • Takes into account constants that that are only
known a:er query analysis. • Greatly improves CPU efficiency
• NaBve UDFs/UDAs can benefit from this as well.
• Instead of providing the UDF/UDA as a shared object, compile it (with CLANG) with an addiBonal flag and Impala to LLVM IR
• IR will be integrated with the query execuBon. • No funcBon call overhead for UDF/UDAs
LimitaBons
23
• Hive UDAs/UDTFs not supported • No UDTFs in naBve interface • Can’t run out of process
• NaBve interface is designed to support this, will be able to run without a recompile
• We’re planning to address this in Impala 1.3
Thanks!
24
• We’d love your feedback for UDFs/UDAs
• QuesBons?
Performance Considerations
for Cloudera Impala
Henry Robinson [email protected] / @henryrImpala Meetup 2013-08-20
Agenda
● The basics: Performance Checklist● Review: How does Impala execute queries?● What makes queries fast (or slow)?● How can I debug my queries?
Impala Performance Checklist
● Verify – Simple count * query on a relatively big table and verify:○ Data locality, block locality, and NO check-summing (“Testing Impala
Performance”)○ Optimal IO throughput of HDFS scans (typically ~100 MB/s per disk)
● Stats – BOTH table and column stats, especially for:○ Joining two large tables○ Insert into as select through Impala
● Join table ordering – will be automatic in the Impala 2.0 wave. Until then:○ Largest table first○ Then most selective to least selective
● Monitor - monitor Impala queries to pinpoint slow queries and drill into potential issues○ CM 4.6 adds query monitoring○ CM 5.0 will have the next big enhancements
Part 1: How does Impala execute queries?
The basic idea
● Every Impala query runs across a cluster of multiple nodes, with lots of available CPU cores, memory and disk
● Best query speeds usually come when every node in the cluster has something to do
● Impala solves two basic problems: ○ Figure out what every node should do (compilation)○ Make them do it really quickly! (execution)
Query compilation
● a.k.a. ‘figuring out what every node should do’
● Impala compiles a SQL query into a plan describing what to execute, and where
● A plan is shaped like a tree. Data flows up from the leaves of the tree to the root.
● Each node in the tree is a query operator
● Impala chops this tree up into plan fragments
● Each node gets one or more plan fragments
Query execution
● Once started, each query operator can run independently of any other operator
● Every operator can be doing something at the same time
● This is the not-so-secret sauce for all massively parallel query execution engines
Part 2: What makes queries fast (or... slow)?
What determines performance?
● Data size
● Per-operator execution efficiency
● Available parallelism
● Available concurrency
● Hardware
● Schema design and file format
Data size
● More data means more work
● Not just the size of the disk-based data at plan leaves, but size of internal data flowing in to any operator
● How can you help?○ Partition your data
○ SELECT with LIMIT in subqueries
○ Push predicates down
○ Use correct JOIN order■ Gather table statistics
○ Use the right file format
● Tables are joined in the order listed in the FROM clause
● Impala uses left-deep trees for nested joins
● “Largest” table should be listed first○ largest = returning most rows before join filtering○ In a star schema, this is often the fact table
● Then list tables in order of most selective join filter to least selective○ Filter the most rows as early as possible
Table Ordering
Join Types
● Two types of join strategy are supported○ Broadcast○ Shuffle/Partitioned
● Broadcast○ Each node receives a full copy of the right table○ Per node memory usage = size of right table
● Shuffle○ Both sides of the join are partitioned○ Matching partitions sent to same node○ Per node memory usage = 1/nodes x size of right table
● Without column statistics, all joins are broadcast
Per-operator execution efficiency
● Impala is fast, and getting faster
● LLVM-based improvements
● More efficient disk scanners
● More modern algorithms from the DB literature
● How can you help?○ Upgrade to the latest version
Available parallelism
● Parallelism: number of resources available to use at once
● More hardware means more parallelism
● Impala will take advantage of more cores, disks and memory where possible
● Easiest (but most expensive!) way to improve performance of large class of queries
● You can scale up incrementally
Available concurrency
● Concurrency: how well can a query take advantage of available parallelism?
● Impala will take care of this mostly for you
● But some operators naturally don’t parallelise well in certain conditions
● For example: joining two huge tables together.○ The hash-node operators have to wait for one side to be read
completely before reading much of the other side
● How you can help:○ Read the profiles, look for obvious bottlenecks, rephrase if possible
Hardware
● Designed for modern hardware○ Leverages SSE 4.2 (Intel Nehalem or newer)○ LLVM Compiler Infrastructure○ Runtime Code Generation○ In-memory execution pipelines
● Today’s hardware○ 2 x Xeon E5 6 core CPUs○ 12 x 3 TB HDD○ 128 GB RAM
● How you can help:○ Use the supported platforms, with Cloudera’s
packages
Schema design
● PARTITION BY is an easy win
● In general, string is slower than fixed-width types (particularly for aggregations etc)
● File formats are crucial○ Experiment with Parquet for performance○ Avoid text
Supported File Formats
● Various HDFS file formats○ Text File (read/write)○ Avro (read)○ SequenceFile (read)○ RCFile (read)○ ParquetFile (read/write)
● Various compression codecs○ Snappy (ParquetFile, RCFile, SequenceFile, Avro)○ LZO (Text)○ Bzip (ParquetFile, RCFile, SequenceFile, Avro)○ Gzip (ParquetFile, RCFile, SequenceFile, Avro)
● HBase also supported
Partitioning Considerations
● Single largest performance feature○ Skips unnecessary data○ Requires queries contain partition keys as filters
● Choose a reasonable number of partitions○ Lots of small files becomes an issue○ Metadata overhead on NameNode○ Metadata overhead for Hive Metastore○ Impala caches this, but first load may take long
Part 3: Debugging queries
The Debug Pages
● Every impalad exports a lot of useful information on http://<impalad>:25000 (by default), including:○ Last 25 queries○ Active sessions○ Known tables○ Last 1MB of the log○ System metrics○ Query profiles
● Information-dense - not for the faint of heart!
Thanks! Questions?
Try It Out!● Apache-licensed open source
○ Impala 1.1 released 7/24/2013○ Impala 1.0 GA released 4/30/2013
● Questions/comments?○ Download: cloudera.com/impala○ Email: [email protected]○ Join: groups.cloudera.org○ MeetUp: meetup.com/Bay-Area-Impala-Users-
Group/