cloudera impala - san diego big data meetup august 13th 2014

1

Cloudera Impala SD Big Data Monthly Meetup #2 August 13th 2014 Maxime Dumas Systems Engineer

Thirty Seconds About Max

•  Systems Engineer •  aka Sales Engineer •  SoCal, AZ, NV

•  former coder of PHP •  teaches meditaLon + yoga •  from Montreal, Canada

2

What Does Cloudera Do?

•  product •  distribuLon of Hadoop components, Apache licensed •  enterprise tooling

•  support •  training •  services (aka consulLng) •  community

3

What This Talk Isn’t About

•  deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  sizing & tuning •  depends heavily on data and workload

•  coding •  unless you count XML or CSV or SQL

•  algorithms

4

Public Domain IFCAR

What is Cloudera Impala?

6

cloud·∙e·∙ra im·∙pal·∙a

7

/kloudˈi(ə)rə imˈpalə/ noun

a modern, open source, MPP SQL query engine for Apache Hadoop. “Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complemenLng tradiLonal MapReduce batch processing.”

8

Quick and dirty, for context.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaLons •  ConfiguraLon •  Workflow

9

HDFS

•  Distributed, highly fault-‐tolerant filesystem •  OpLmized for large streaming access to data •  Based on Google File System

•  hjp://research.google.com/archive/gfs.html

10

Lots of Commodity Machines

11

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]



Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realLme •  Works well with distributed compuLng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hjp://research.google.com/archive/mapreduce.html

12

Apache Hive

•  AbstracLon of Hadoop’s Java API •  HiveQL “compiles” down to MR

•  a “SQL-‐like” language

•  Eases analysis using MapReduce

13

Apache Hive Metastore

•  Maps HDFS files to DB-‐like resources •  Databases •  Tables •  Column/field names, data types •  Roles/users •  InputFormat/OutputFormat

14

Sqoop

©2011 Cloudera, Inc. All Rights Reserved. 15

•  SQL to Hadoop

•  Tool to import/export any JDBC-‐supported database into Hadoop

•  Transfer data between Hadoop and external databases or EDW

•  High performance connectors for some RDBMS

•  Oracle, Teradata, Netezza

•  Developed at Cloudera

17

Familiar interface, but more powerful.

Cloudera Impala

Cloudera Impala

18

Interac(ve SQL for Hadoop § Responses in seconds § Nearly ANSI-‐92 standard SQL with Hive SQL

Na(ve MPP Query Engine § Purpose-‐built for low-‐latency queries §  Separate runLme from MapReduce § Designed as part of the Hadoop ecosystem

Open Source § Apache-‐licensed

Benefits of Impala

19

More & Faster Value from “Big Data” §  InteracLve BI/AnalyLcs experience via SQL §  No delays from data migraLon

Flexibility §  Query across exisLng data §  Select best-‐fit file formats (Parquet, Avro, etc.) §  Run mulLple frameworks on the same data at the same Lme

Cost Efficiency §  Reduce movement, duplicate storage & compute §  10% to 1% the cost of analyLc DBMS

Full Fidelity Analysis §  No loss from aggregaLons or fixed schemas

Impala Use Cases

20

InteracLve BI/analyLcs on more data

Asking new quesLons – exploraLon, ML

Data processing with Lght SLAs

Query-‐able archive w/full fidelity

Cost-‐effec(ve, ad hoc query environment that offloads the data warehouse for:

Our Design Strategy

21

One pool of (open) data

One metadata model

One security framework

One set of system resources

An Integrated Part of the Hadoop System

In-‐Memory Processing & Streaming

Spark

Storage

Integra(on

Resource Management

Metad

ata

Batch Processing MAPREDUCE, HIVE & PIG

…

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Engines

InteracLve SQL

CLOUDERA IMPALA

InteracLve Search CLOUDERA SEARCH

Machine Learning MAHOUT,

ClouderaML, Oryx

Math & Sta(s(cs

SAS, R

Security

Impala Key Features

22

Fast Flexible Secure

Easy to Implement Easy to Use Simple to Manage

§  In-‐memory data transfers §  ParLLoned joins

§  Fully distributed aggregaLons

§  Query data in HDFS & HBase §  Supports mul(ple file formats

& compression algorithms

§  Java & Na(ve UDFs, UDAFs

§  Integrated with Hadoop security

§  Kerberos authenLcaLon

§  Authoriza(on (Sentry)

§  Leverages Hive’s ODBC/JDBC connectors, metastore & SQL syntax

§  Open source

§  Interact with data via SQL §  CerLfied with leading BI tools

§  Deploy, configure & monitor with Cloudera Manager

§  Integrated with Hadoop resource management

What’s Coming?*

23

SQL 2003-‐Compliant AnalyLc Window FuncLons

AddiLonal AuthenLcaLon Mechanisms

User Defined Table FuncLons

Intra-‐node Parallelized AggregaLons & Joins

Nested Data

Enhanced YARN-‐Integrated Resource Manager

Dynamic ParLLon Pruning

In the Near Term:

*On the roadmap… no guarantees

Impala Plays Well with Others

24

BI Partners: Building on the

Enterprise Standard POWERED BY

IMPALA

Not All SQL On Hadoop Is Created Equal

25

Batch MapReduce Make MapReduce faster

Slow, s(ll batch

Remote Query Pull data from HDFS over the network to the DW

compute layer

Slow, expensive

Siloed DBMS Load data into a

proprietary database file

Rigid, siloed data, slow ETL

Impala Na(ve MPP query engine that’s integrated into

Hadoop

Fast, flexible, cost-‐effec(ve

$

DMBS Hadoop

More Detail On AlternaLve Approaches

26

Batch MapReduce

§  Batch-‐oriented §  High latency

Remote Query Siloed DBMS

Hadoop DMBS

HDFS Storage

Compute Compute

§  Network bojleneck §  2x the hardware §  Duplicate metadata, security, SQL, etc.

Storage (HDFS)

Integra(on

Resource Management

Hado

op M

etad

ata

DBMS

Hadoop Engines

MAPREDUCE, HIVE, PIG, IMPALA, ETC.

DBMS Metad

ata

PROPRIETARY STANDARD & SHARED

§  RDBMS rigidity §  Query subset of data §  Duplicate storage, metadata, security, SQL, etc.

Storage

Integra(on

Resource Management

Metad

ata

Batch Processing

… InteracLve SQL

Machine Learning

HDFS HBase

Security Security

Other Sexy New Big Data MPP Tools

27

Presto Purpose-‐Built MPP Engine; Similar Architecture to Impala; Few Performance Comparisons, but Impala Anecdotally 5x-‐10x Faster

Shark Hive-‐CompaLble Data Warehouse for Spark; Great Performance unLl Required to go to Disk, at Which Point Impala Bejer; With HDFS Caching Impala will Perform on Par from a Memory PerspecLve

Drill Open Source version of Dremel; Another MPP Engine; MulLple Data Formats and Sources

Phoenix – Sort Of SQL Skin over HBase (and Only HBase); Subset of SQL Standard

What About an EDW/RDBMS?

“Right Tool for the Right Job” EDW/RDBMS Great For:

•  OLTP’s complex transacLons •  Highly planned and opLmized known workloads •  Opera'onal reports and repeated known queries

Impala Great For:

•  Exploratory analy'cs with previously-‐unknown queries •  Queries on big and growing data sets

EDW/RDBMS Can’t: •  Dump in raw data then later define schema and query what you want •  Evolve schemas without an expensive schema upgrade planning process •  Simply scale just by adding industry-‐standard servers •  Store at < $1k/TB instead of $10-‐150k/TB

28

29

Impala Technical Details

The Impala Advantage

30

No MapReduce; No JVM; All NaLve

In-‐Memory Data Transfers

Saturate Disks on Reads

OpLmized File Format (ie Parquet)

In-‐Memory HDFS Caching Cost-‐Based Join Order OpLmizaLon – Frees User from Having to Guess the Correct Join Order

Where does the Performance Come From?

Impala and Hive

31

Shares Everything Client-‐Facing §  Metadata (table definiLons) §  ODBC/JDBC drivers §  SQL syntax (Hive SQL) §  Flexible file formats §  Machine pool §  Hue GUI

But Built for Different Purposes §  Hive: runs on MapReduce and ideal for batch processing

§  Impala: naLve MPP query engine ideal for interacLve SQL

Storage

Integra(on

Resource Management

Metad

ata

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Hive SQL Syntax Impala

SQL Syntax + Compute Framework MapReduce

Compute Framework

Batch Processing

InteracLve

SQL

Impala Query ExecuLon

32

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive

Metastore HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell


33

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collec(ons of plan fragments 3) Coordinator ini(ates execu(on on impalad(s) local to data


34

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client

Query results

Parquet File Format

35

Open source, columnar Hadoop file format developed by Cloudera & Twiler Limits the IO to only the data that is needed

Supports storing each column in a separate file

Saves space: columnar layout compresses bejer

Enables bejer scans: load only the columns that are needed

Supports index pages for fast lookup

Extensible value encodings

36

Impala Performance Results

Impala Performance Results

•  Impala’s Milestone in Jan 2014: •  Comparable commercial MPP DBMS speed •  NaLvely on Hadoop

•  Three Result Sets: •  Impala vs Hive 0.12 (Impala 6-‐70x faster) •  Impala vs “DBMS-‐Y” (Impala average of 2x faster) •  Impala scalability (Impala achieves linear scale)

•  Background •  20 pre-‐selected, diverse TPC-‐DS queries (modified to remove unsupported

language) •  Sufficient data scale for realisLc comparison (3 TB, 15 TB, and 30 TB) •  RealisLc nodes (e.g. 8-‐core CPU, 96GB RAM, 12x2TB disks) •  Methodical tesLng (mulLple runs, reviewed fairness for compeLLon, etc)

•  Details: hjp://blog.cloudera.com/blog/2014/01/impala-‐performance-‐dbms-‐class-‐speed/

37

Enough slides… DEMO TIME!

38

So What is Cloudera Impala?

39

What’s Next?

•  Download Hadoop! •  CDH available at www.cloudera.com •  Try it online: Cloudera Live

•  Cloudera provides pre-‐loaded VMs •  hjp://Lny.cloudera.com/quickstartvm

•  Ride Impala! •  hjp://impala.io/

40

41

SAN DIEGO BIG DATA

Special thanks:

42

Preferably related to the talk… or not.

QuesLons?

43

Thank You! Maxime Dumas [email protected] We’re hiring.

cloudera impala - san diego big data meetup august 13th 2014

Software

r security

justhadoop hdfs mapreduce

hivepig hdfs hbase text

yahoo hadoop cluster

oryx math stascs sas

xthehardware duplicatemetadata

kloudirimpal noun amodern

yoga frommontreal