(aaron myers) hdfs impala

HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS

How HDFS is evolving to meet new needs

✛ Aaron T. Myers > Email: [email protected], [email protected] > Twitter: @atm

✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security

2

✛ HDFS introduction/architecture ✛  Impala introduction/architecture ✛ New requirements for HDFS

> Block replica / disk placement info > Correlated file/block replica placement >  In-memory caching for hot files > Short-circuit reads, reduced copy overhead

3

HDFS INTRODUCTION

✛ HDFS is the Hadoop Distributed File System ✛ Append-only distributed file system ✛  Intended to store many very large files

> Block sizes usually 64MB – 512MB > Files composed of several blocks

✛ Write a file once during ingest ✛ Read a file many times for analysis

5

✛ HDFS originally designed specifically for Map/Reduce > Each MR task typically operates on one HDFS block > MR tasks run co-located on HDFS nodes > Data locality: move the code to the data

✛ Each block of each file is replicated 3 times > For reliability in the face of machine, drive failures > Provide a few options for data locality during

processing

6

HDFS ARCHITECTURE

✛ Each cluster has… > A single Name Node

∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping

> Many Data Nodes ∗  Store actual file data

> Clients of HDFS… ∗  Communicate with Name Node to browse file system, get

block locations for files ∗  Communicate directly with Data Nodes to read/write files

8

IMPALA INTRODUCTION

✛ General-purpose SQL query engine: > Should work both for analytical and transactional

workloads > Will support queries that take from milliseconds to

hours ✛ Runs directly within Hadoop:

> Reads widely used Hadoop file formats > Talks directly to HDFS (or HBase) > Runs on same nodes that run Hadoop processes

11

✛ Uses HQL for query language > Hive Query Language – what Apache Hive uses > Very close to complete SQL-92 compliance

✛ Extremely high performance > C++ instead of Java > Runtime code generation > Completely new execution engine that doesn't build

on MapReduce

12

✛ Runs as a distributed service in cluster > One Impala daemon on each node with data > Doesn’t use Hadoop Map/Reduce at all

✛ User submits query via ODBC/JDBC to any of the daemons

✛ Query is distributed to all nodes with relevant data

✛  If any node fails, the query fails and is reexecuted

13

IMPALA ARCHITECTURE

✛  Two daemons: impalad and statestored ✛  Impala daemon (impalad)

> Handles client requests > Handles all internal requests related to query

execution ✛ State store daemon (statestored)

> Provides name service of cluster members > Hive table metadata distribution

15

✛ Query execution phases > Request arrives to impalad via odbc/jdbc > Planner turns request into collection of plan fragments

∗  Plan fragments may be executed in parallel

> Coordinator impalad initiates execution of plan fragments on remote impalad daemons

✛ During execution >  Intermediate results are streamed between executors > Query results are streamed back to client

16

17

✛ During execution, impalad daemons connect directly to HDFS/HBase to read/write data

HDFS IMPROVEMENTS MOTIVATED BY IMPALA

✛  Impala is concerned with very low latency queries > Need to make best use of available aggregate disk

throughput ✛  Impala’s more efficient execution engine is far

more likely to be I/O bound as compared to Hive >  Implies that for many queries the best performance

improvement will be from improved I/O ✛  Impala query execution has no shuffle phase

>  Implies that joins between tables does not necessitate all-to-all communication

19

✛ Expose HDFS block replica disk location information

✛ Allow for explicitly co-located block replicas across files

✛  In-memory caching of hot tables/files ✛ Reduced copies during reading, short-circuit

reads

20

✛  The problem: NameNode knows which DataNodes blocks are on, not which disks > Only the DNs are aware of block replica -> disk map

✛  Impala wants to make sure that separate plan fragments operate on data on separate disks > Maximize aggregate available disk throughput

21

✛  The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on

✛ During query planning phase, impalad… > Determines all DNs data for query is stored on > Queries those DNs to get volume information

✛ During query execution phase, impalad… > Queues disk reads so that only 1 or 2 reads ever

happen to a given disk at a given time ✛ With this additional info, Impala is able to ensure

disk reads are large, minimize seeks

22

✛  The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN

✛  Local reads at full disk throughput: ~800 MB/s ✛ Remote reads in a 1 gigabit network: ~128 MB/s ✛  Ideally all reads should be done on local disks

23

✛  The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes

✛ Gives Impala more control to lay out data ✛ Can ensure that tables/files which are joined

frequently have their data co-located ✛ Additionally, more fine-grained block placement

control allows for potential improvements in columnar formats like Parquet

24

✛  The problem: Impala queries are often bottlenecked at maximum disk throughput

✛ Memory throughput is much higher ✛ Memory is getting cheaper/denser

> Routinely seeing DNs with 48GB-96GB of RAM

✛ We’ve observed substantial Impala speedups when file data ends up in OS buffer cache

25

✛  The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory

✛ Allows Impala to read data at full memory bandwidth speeds (several GB/s)

✛ Give cluster operator control over which files/tables are queried frequently and thus should be kept in memory > Don’t want an MR job to inadvertently evict data from

memory via the OS buffer cache

26

✛  The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc.

✛ All of these extraneous copies use unnecessary memory, CPU resources

27

✛  The solution: Allow for reads to be performed directly on local files, use direct buffers

✛ Added facility to HDFS to allow for reads to completely bypass DataNode when client co-located with block replica files

✛ Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum

28

✛  For simpler queries (no joins, tpch-q*) on large datasets (1TB) > 5-10x faster than Hive

✛  For complex queries on large datasets (1TB) > 20-50x faster than Hive

✛  For complex queries out of buffer cache (300GB) > 25-150x faster than Hive

✛ Due to Impala’s improved execution engine, low startup time, improved I/O, etc.

29

(aaron myers) hdfs impala

Technology

hdfs nodes data locality

hdfs improvements

hdfs architecture

impala queries

data co

dns data

cloudera impala

impala architecture