(aaron myers) hdfs impala
DESCRIPTION
TRANSCRIPT
HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS
How HDFS is evolving to meet new needs
✛ Aaron T. Myers > Email: [email protected], [email protected] > Twitter: @atm
✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security
2
✛ HDFS introduction/architecture ✛ Impala introduction/architecture ✛ New requirements for HDFS
> Block replica / disk placement info > Correlated file/block replica placement > In-memory caching for hot files > Short-circuit reads, reduced copy overhead
3
HDFS INTRODUCTION
✛ HDFS is the Hadoop Distributed File System ✛ Append-only distributed file system ✛ Intended to store many very large files
> Block sizes usually 64MB – 512MB > Files composed of several blocks
✛ Write a file once during ingest ✛ Read a file many times for analysis
5
✛ HDFS originally designed specifically for Map/Reduce > Each MR task typically operates on one HDFS block > MR tasks run co-located on HDFS nodes > Data locality: move the code to the data
✛ Each block of each file is replicated 3 times > For reliability in the face of machine, drive failures > Provide a few options for data locality during
processing
6
HDFS ARCHITECTURE
✛ Each cluster has… > A single Name Node
∗ Stores file system metadata ∗ Stores “Block ID” -> Data Node mapping
> Many Data Nodes ∗ Store actual file data
> Clients of HDFS… ∗ Communicate with Name Node to browse file system, get
block locations for files ∗ Communicate directly with Data Nodes to read/write files
8
9
IMPALA INTRODUCTION
✛ General-purpose SQL query engine: > Should work both for analytical and transactional
workloads > Will support queries that take from milliseconds to
hours ✛ Runs directly within Hadoop:
> Reads widely used Hadoop file formats > Talks directly to HDFS (or HBase) > Runs on same nodes that run Hadoop processes
11
✛ Uses HQL for query language > Hive Query Language – what Apache Hive uses > Very close to complete SQL-92 compliance
✛ Extremely high performance > C++ instead of Java > Runtime code generation > Completely new execution engine that doesn't build
on MapReduce
12
✛ Runs as a distributed service in cluster > One Impala daemon on each node with data > Doesn’t use Hadoop Map/Reduce at all
✛ User submits query via ODBC/JDBC to any of the daemons
✛ Query is distributed to all nodes with relevant data
✛ If any node fails, the query fails and is reexecuted
13
IMPALA ARCHITECTURE
✛ Two daemons: impalad and statestored ✛ Impala daemon (impalad)
> Handles client requests > Handles all internal requests related to query
execution ✛ State store daemon (statestored)
> Provides name service of cluster members > Hive table metadata distribution
15
✛ Query execution phases > Request arrives to impalad via odbc/jdbc > Planner turns request into collection of plan fragments
∗ Plan fragments may be executed in parallel
> Coordinator impalad initiates execution of plan fragments on remote impalad daemons
✛ During execution > Intermediate results are streamed between executors > Query results are streamed back to client
16
17
✛ During execution, impalad daemons connect directly to HDFS/HBase to read/write data
HDFS IMPROVEMENTS MOTIVATED BY IMPALA
✛ Impala is concerned with very low latency queries > Need to make best use of available aggregate disk
throughput ✛ Impala’s more efficient execution engine is far
more likely to be I/O bound as compared to Hive > Implies that for many queries the best performance
improvement will be from improved I/O ✛ Impala query execution has no shuffle phase
> Implies that joins between tables does not necessitate all-to-all communication
19
✛ Expose HDFS block replica disk location information
✛ Allow for explicitly co-located block replicas across files
✛ In-memory caching of hot tables/files ✛ Reduced copies during reading, short-circuit
reads
20
✛ The problem: NameNode knows which DataNodes blocks are on, not which disks > Only the DNs are aware of block replica -> disk map
✛ Impala wants to make sure that separate plan fragments operate on data on separate disks > Maximize aggregate available disk throughput
21
✛ The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on
✛ During query planning phase, impalad… > Determines all DNs data for query is stored on > Queries those DNs to get volume information
✛ During query execution phase, impalad… > Queues disk reads so that only 1 or 2 reads ever
happen to a given disk at a given time ✛ With this additional info, Impala is able to ensure
disk reads are large, minimize seeks
22
✛ The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN
✛ Local reads at full disk throughput: ~800 MB/s ✛ Remote reads in a 1 gigabit network: ~128 MB/s ✛ Ideally all reads should be done on local disks
23
✛ The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes
✛ Gives Impala more control to lay out data ✛ Can ensure that tables/files which are joined
frequently have their data co-located ✛ Additionally, more fine-grained block placement
control allows for potential improvements in columnar formats like Parquet
24
✛ The problem: Impala queries are often bottlenecked at maximum disk throughput
✛ Memory throughput is much higher ✛ Memory is getting cheaper/denser
> Routinely seeing DNs with 48GB-96GB of RAM
✛ We’ve observed substantial Impala speedups when file data ends up in OS buffer cache
25
✛ The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory
✛ Allows Impala to read data at full memory bandwidth speeds (several GB/s)
✛ Give cluster operator control over which files/tables are queried frequently and thus should be kept in memory > Don’t want an MR job to inadvertently evict data from
memory via the OS buffer cache
26
✛ The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc.
✛ All of these extraneous copies use unnecessary memory, CPU resources
27
✛ The solution: Allow for reads to be performed directly on local files, use direct buffers
✛ Added facility to HDFS to allow for reads to completely bypass DataNode when client co-located with block replica files
✛ Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum
28
✛ For simpler queries (no joins, tpch-q*) on large datasets (1TB) > 5-10x faster than Hive
✛ For complex queries on large datasets (1TB) > 20-50x faster than Hive
✛ For complex queries out of buffer cache (300GB) > 25-150x faster than Hive
✛ Due to Impala’s improved execution engine, low startup time, improved I/O, etc.
29