anatomy of file read in hadoop
DESCRIPTION
This presentation explains anatomy of Hadoop file read. Few slides are animated. So its better you download the slide and read it.TRANSCRIPT
ANATOMY
OF FIL
E READ
IN H
ADOOP
RAJESH A
NANDA KUMAR
RAJESH_1
290K@YA
HOO.COM
A SAMPLE HADOOP CLUSTER
Name Node
R1N1 R1N2 R1N3 R1N4
Rack R1
Rack R2
R2N1 R2N2 R2N3 R2N4
1. This is a Hadoop cluster with one name node and two racks named R1 and R2 in a data center D1. Each rack has 4 nodes and they are uniquely identified as R1N1, R1N2 and so on.
2. Replication factor is 3.3. HDFS block size is 64 MB.4. This cluster is used as an example to explain the concepts.
Data center D1
FACTS TO BE KNOW
1. Name node saves part of HDFS metadata like file location, permission, etc. in files called namespace image and edit logs. Files are stored in HDFS as blocks. These block information are not saved in any file. Instead it is gathered every time the cluster is started. And this information is stored in name node’s memory.
2. Replica Placement : Assuming the replication factor is 3; When a file is written from a data node (say R1N1), Hadoop attempts to save the first replica in same data node (R1N1). Second replica is written into another node (R2N2) in a different rack (R2). Third replica is written into another node (R2N1) in the same rack (R2) where the second replica was saved.
3. Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor. The levels can be like; “Data Center” > “Rack” > “Node”. Example; ‘/d1/r1/n1’ is a representation for a node named n1 on rack r1 in data center d1. Distance calculation has 4 possible scenarios as;1. distance(/d1/r1/n1, /d1/r1/n1) = 0 [Processes on
same node]2. distance(/d1/r1/n1, /d1/r1/n2) = 2 [different node is
same rack]3. distance(/d1/r1/n1, /d1/r2/n3) = 4 [node in different
rack in same data center]4. distance(/d1/r1/n1, /d2/r3/n4) = 6 [node in different
data center]
ANATOMY OF FILE READ – HAPPY PATH
Name Node
R1N1 R1N2 R1N3 R1N4
Rack R1
Rack R2
R2N1 R2N2 R2N3 R2N4
• Let’s assume a file named “sfo_crime.csv” of size 192 MB is saved in this cluster. • Also assume that the file was written from node R1N1.
sfo_crimes.csv
B1 B2 B3
B3 B1 B1 B3 B2 B2
B1
B2
B3
R2N1 R2N2R1N1
• The file is split into 3 blocks each of size 64 MB. And each block is copied 3 times in the cluster.
• Along with data, a checksum will be saved in each block. This is used to ensure the data read
from the block is read with out error.
R2N3 R2N4R1N1
R2N3 R2N1R1N1
Metadata
• When cluster is started, the metadata will look as shown on top right corner.
• Metadata is written in name node.
• When the cluster is up and running, the name node looks like how its shown here (right-side).• Let’s say we are trying to read the “sfo_crimes.csv” file from R1N2.• So a HDFS Client program will run on R1N2’s JVM.
RIN2 JVM
HDFS
Client
DistributedFileSystem Name Node
open()
B1
B2
B3
R2N1
R1N1
R2N2
R2N3
R1N1
R2N4
R2N3
R1N1
R2N1
Metadata
RPC call to get first few blocks of file
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• DFS makes a RPC call returns first few blocks on the file. NN returns the address of the
DN ORDERED with respect to the node from where the read is performed.
DFSInputStream
FSDataInputStream
• The block information is saved in DFSInputStream which is wrapped in FSDataInputStream.
• In response to ‘FileSystem.open()’, HDFS Client receives this FSDataInputStream.
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• First the HDFS client program calls the method open() on a Java class DistributedFileSystem (subclass of FileSystem).
HDFS
Client
RIN2 JVM
DFSInputStream
FSDataInputStream
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• From now on HDFS Client deals with FSDataInputStream (FSDIS).• HDFS Client invokes read() on the stream.
• Blocks are read in order. DFSIS connects to the closest node (R1N1) to read block B1. • DFSIS connects to data node and streams data to client, which calls read() repeatedly on the stream. DFSIS verifies checksums for the data transferred to client.
read()
R1N1
DFSIS connects to R1N1 to read block B1
Data streamed to client directly from data node.
• When the block is read completely, DFSIS closes the connection.
Name Node
HDFS
Client
RIN2 JVM
DFSInputStream
FSDataInputStream
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• Next DFSIS attempts to read block B2. As mentioned earlier, the previous connection is closed and a fresh connection is made to the closest node (R1N1) of block B2.
read()
R1N1
DFSIS connects to R1N1 to read block B2
Data streamed to client directly from data node.
Name Node
HDFS
Client
RIN2 JVM
DFSInputStream
FSDataInputStream
• Now DFSIS has read all blocks returned by the first RPC call (B1 & B2). But the file is not read completely. In our case there is one more block to read.• DFSIS calls name node to get data node locations for next batch of blocks as needed.
read()
R1N1
DFSIS connects to R1N1 to read block B3
Data streamed to client directly from data node.
Name NodeB3 (R1N1, R2N1, R2N3)
B3 (R1N1, R2N1, R2N3)
close()
• After the complete file is read for the HDFS client call close().
ANATOMY OF FILE READ – DATA NODE CONNECTION ERROR
HDFS
Client
RIN2 JVM
DFSInputStream
FSDataInputStream
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• Let’s say there is some error while connecting to R1N1.
read()
R1N1
DFSIS connects to R1N1 to read block B2
Data streamed to client directly from data node.
Name Node
R1N1
• DFSIS remembers this info, so it won’t try to read from R1N1 for future blocks. Then it tries to connect to next closest node (R2N3).
R2N3
ANATOMY OF FILE READ – DATA NODE CHECKSUM ERROR
HDFS
Client
RIN2 JVM
DFSInputStream
FSDataInputStream
B1 (R1N1, R2N1, R2N2)B2 (R1N1, R2N3, R2N4)
• Let’s say there is a checksum error. This means the block is corrupt.
read()
R1N1
DFSIS connects to R1N1 to read block B2
Data streamed to client directly from data node.
Name Node
• Information about this corrupt block is sent to name node. Then DFSIS tries to connect to next closest node (R2N3).
R2N3
Inform name node that the block in R1N1 is corrupt.