hadoop world: low latency, random reads from hdfs

RadFS

Random Access DFS

DFS in a slide

Locates DataNode, opens socket, says hi

DataNode allocates a thread to stream block contents from opened position

Client reads in as it processes

Great for streaming entire blocks

Positioned reads cut across the grain

A Different Approach

Everything is a positioned read

All interactions with DN are stateless

Wrapped ByteService interfaces Caching, Network, Checksum

Each configurable, can be optimized for task

Connections pooled and re-used often

Fewer threads on server

DFS Anatomy – seek + read()

DFS seek+read con't

Locates preferred block, caches DN locations

Opens new Socket and BlockReader for each random read

Reads from the Socket

Object creation means GC debt

No optimizations for repeated reads of the same data

Threads will consume resources on server after client hangup – files aren't automatically closed

RadFS Overview

RadFS seek+read, con't

Transparently caches frequently read data

Automatically pools/manages file handles

Reduces network congestion (in theory)

Lower DataNode workload 3 threads total instead of 1 per Xceiver

Configurable on client side for the task at hand

Network latency penalty on long-running reads

Checksum implementation means 2 reads per random read if caching is disabled

Implementation Notes

Checksum is currently generated by wrapping CheckSumFileSystem around RadFileSystem

Inefficient, reads 2 files over dfs

Improper – what if checksum block is corrupt?

CachingByteService implements lookahead (good) by copying bytes twice (bad)

Permissions happen “by accident” at namenode Attackable by searching blockid space on DNs

Could exchange UserAccessToken on request

Benchmark Environment

EC2 “Medium” - 2x2GHz, 1.7GB, shared I/O

Operations against 20GB sequence file

All tests run singlethreaded from the lightly loaded namenode

Fast internal network, adequate memory but not enough to page entire file

All benchmarks in a given set were run on the same instance, middle value from 3 runs

Random Reads - 2k

10,000 random reads of 2k each over the length of a 20GB file

DFS averaged 7.8ms while Rad with no cache averaged 4.4ms

Caching added a full 2ms – hardcoded lookahead was no help and lots of unnecessary byte copying

Random Reads – 2kb (avg in ms)

DFS RadFS Rad - No Cache0

2

4

6

8

10

Column 2

SequenceFile Search

Binary search over 10gb sequence file

DFS, RadFS with various cache settings

Indicative of potential filesystem uses Lucene

Large numbers of secondary indices

Ease of development

Read-only RDBMS-like systems built from ETLs or other long-running process

Sequence File Binary Search5000 searches, avg ms per search

DFSRAD0

RAD16MBRAD128MB

0

20

40

60

80

100

120

Column 2

Streaming

DFS is inherently faster for streaming due to the dedicated server thread

Checksumming is expensive!

Early radfs builds beat dfs at 1-byte read()s because they didn't have checksumming

Require a PipeliningByteService for use in streaming jobs that would make requests to Datanode, stream in and checksum in a separate client-side thread

Streaming – 1GB1b reads, time in seconds

DFS RAD no checksum0

50

100

150

200

250

300

350

Column 2

Streaming 1GB2k reads, time in seconds

DFS RAD0

20

40

60

80

100

120

140

160

Column 2

Going forward – modular reader

Going forward - Applications

Could improve Hbase, solves file handle problem and improves latency

Could be used to create low-latency lookup formats accessible from scripting languages

Cache is automatic, simplifying development

“Table” directory with main store file and several secondary index files generated by ETL

Lucene indices? Can be built with MapReduce

Going forward - Development Copy existing HDFS method of interleaving

checksums directly from datanode – one read Audit checksumming code for CPU efficiency

– reading can be CPU bound

Implement as a ByteService instead of clumsy wrapper around FileSystem. Make configurable

Implement PipeliningByteService to improve streaming by pre-fetching pages

Exchange UserAccessToken at each read, could possibly use for encryption of blockid

Contribute!

Patch is at Apache JIRA issue HDFS-516

Will be on GitHub momentarily

Goals: Equivalent streaming performance to DFS

Faster random read, caching option

Lower resource consumption on server

3 doable tasks above

Large configuration space to explore

Email me: [email protected]

hadoop world: low latency, random reads from hdfs

Documents