introduction to hdfs

34
1 Introduction to HDFS By: Siddharth Mathur Instructor: Dr. Shiyong Lu

Upload: siddharth-mathur

Post on 20-Jan-2015

260 views

Category:

Data & Analytics


2 download

DESCRIPTION

A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x

TRANSCRIPT

Page 1: Introduction to HDFS

1

Introduction to HDFS

By: Siddharth Mathur

Instructor: Dr. Shiyong Lu

Page 2: Introduction to HDFS

2

Big DataWikipedia Definition:

In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.

Page 3: Introduction to HDFS

3

How Big is Big Data?2008: Google processed 20 PB a day

2009: Facebook had 2.5 PB user data + 15 TB/day

2009: eBay had 6.5 PB user data + 50 TB/day

2011: Yahoo! had 180-200 PB of data

2012: Facebook ingests 500 TB/day

Page 4: Introduction to HDFS

4

HOW TO ANALYZE THIS DATA?

Page 5: Introduction to HDFS

5

Divide and Conquer

Partition

Combine

Page 6: Introduction to HDFS

6

But Parallel Processing is complicatedHow do we assign tasks to workers?

What if we have more tasks than slots?

What happens when tasks fail?

How do you handle distributed synchronization?

Page 7: Introduction to HDFS

7

The Solution!

Google File

System

Map Reduce

BigTable

Page 8: Introduction to HDFS

8

GFS to HDFSIt started when google researchers wrote a paper on a distributed file system to resolve storage and analysis issues of Big Data

The researchers proposed a file system named Google File System which in turn, gave birth to Hadoop Distributed File System (HDFS)

The paper on MapReduce resulted in MapReduce programming structure

The paper on BigTable produced Hadoop Hbase, Data warehouse schema over HDFS

Page 9: Introduction to HDFS

9

HADOOP DISTRIBUTED FILE SYSTEM

Page 10: Introduction to HDFS

10

Key Features Accesible  

Hadoop runs on large clusters of commodity machines or on cloud computing services such as Amazon's Elastic Compute Cloud (EC2).

RobustAs Hadoop is intended to run on commodity hardware, It is architected with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.

ScalableHadoop scales linearly to handle larger data by adding more nodes to the cluster.

SimpleHadoop allows users to quickly write efficient parallel code. 

Page 11: Introduction to HDFS

11

HDFS Scaling Out

Performs a task in 45 minutes

Performs a task in ~ 45/4

minutes

Page 12: Introduction to HDFS

12

Basic Hadoop Stack

Hadoop Distributed File System

MapReduce

Hba

se

Higher Level Languages

Page 13: Introduction to HDFS

13

Hadoop PlatformsPlatforms: Unix and on Windows.

Linux: the only supported production platform.

Other variants of Unix, like Mac OS X: run Hadoop for development.

Windows + Cygwin: development platform (openssh)

Java 6

Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop.

Page 14: Introduction to HDFS

14

Hadoop Modes• Standalone (or local) mode

– There are no daemons running and everything runs in a single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

• Pseudo-distributed mode– The Hadoop daemons run on the local machine, thus

simulating a cluster on a small scale.

• Fully distributed mode– The Hadoop daemons run on a cluster of machines.

Page 15: Introduction to HDFS

15

Master-Slave Architecture

Namenode

Jobtracker

Datanode

Tasktracker

Secondary Namenode

Page 16: Introduction to HDFS

16

Master-Slave ArchitectureHDFS has a master-slave architecture.

The master node or the name node governs the cluster. It takes care of tasks and resource allocation.

It stores all the metadata related to file breakage, block storage, block replication and task execution status.

The slave nodes or the data nodes are the one which stores all the data blocks and perform task executions

Tasktracker is the program which runs on each individual data node and monitors the task execution over each node.

Jobtracker runs on name node and monitors the complete job execution.

Page 17: Introduction to HDFS

17

HDFS File Distribution

File metadataFILE-A -> 1,2,3 (split into 3 blocks)FILE-B -> 4,5 (split into 2 blocks)

1

3

1

3

Replication factor = 3Hdfs-site.xml

“ dfs.replication”

4 3

4 4

22

2 5

5

5

Block

1

Page 18: Introduction to HDFS

18

HDFS File DistributionName node stores metadata related to:

File split

Block allocation

Task allocation

Each file is split into data blocks. Default size is 64 Mb

Each data block is replicated on different data node. The replication factor in configurable. Default value is 3

Page 19: Introduction to HDFS

19

Block PlacementCurrent Strategy

-- One replica on local node

-- Second replica on a remote rack

-- Third replica on same remote rack

-- Additional replicas are randomly placed

Clients read from nearest replica

Page 20: Introduction to HDFS

20

Rack awareness

DN 1

DN 2

DN 3

DN 4

DN 5

DN 6

DN 7

DN 8

DN 9

DN 10

DN 11

DN 12

Rack 1 Rack 2 Rack 3

NameNode

File X= Blk:A in DN:1,5,6

Blk:B in DN: 7, 10, 11Rack 1 = DN:1,2,3,4

Rack 2 = DN:5,6,7,8

Rack 3 =DN:9,10,11,12

Switch Switch Switch

Data block A

Data block B

FILE X

Page 21: Introduction to HDFS

21

Rack awarenessHDFS is aware of the placement of each data node and on the racks

To prevent data loss due to a complete rack failure, Hadoop intelligently replicates each data block onto other racks also

This helps HDSF to recover the data even if complete rack of data node shuts down.

This information is stored in the name node.

Page 22: Introduction to HDFS

22

File Write in Hadoop

DN 1

DN 2

DN 3

DN 4

DN 9

DN 10

DN 11

DN 12

Rack 1 Rack 3

NameNode

File.txt= Blk:A in DN:1,5,6

Blk:B in DN: 7, 10, 11

Blk C in…..

Switch Switch

Switch

ClientFile.txt

[A , B, C]

Broken down using

Hadoop client API

DN 5

DN 6

DN 7

DN 8

Rack 2

Switch

First block in one rack next blocks in different

rack

Intelligent storage of

data

Heartbeat

Request

Response

MetaData Creation

Block A Write

Page 23: Introduction to HDFS

23

File Write in Hadoop

HDFS client system requests the name node to write down a file onto HDFS.

It also provide the file size and other metadata information to the name node.

Meanwhile, each slave node sends a heartbeat signal to namenode telling it about their status

Page 24: Introduction to HDFS

24

File Write in Hadoop

The namenode tells the client system where to store the data blocks

Also, it tells the data node to get ready for data write.

After the data write procedure is complete the data node sends a success message to both client and name node.

Page 25: Introduction to HDFS

25

File Read in Hadoop

DN 1

DN 2

DN 3

DN 4

DN 9

DN 10

DN 11

DN 12

Rack 1 Rack 3

NameNode

File.txt= Blk:A in DN:1,5,6

Blk:B in DN: 7, 10, 11

Blk C in…..

Switch Switch

Switch

Client

DN 5

DN 6

DN 7

DN 8

Rack 2

Switch

An ordered list of

nodes.

Heartbeat

Request

Response

Page 26: Introduction to HDFS

26

Re-replicating missing replicas

Page 27: Introduction to HDFS

27

Re-replicationMissing Heartbeats signify lost Nodes

Name Node consults metadata, finds affected data

Name Node consults Rack Awareness script

Name Node tells the Data node to re-replicate

Page 28: Introduction to HDFS

28

3 main configuration filesCore-site.xml

Contains configuration information that overrides the default core Hadoop properties

Mapred-site.xml

Contains configuration information that overrides the default core Mapreduce properties

Also defines the host and port that the MapReduce job tracker runs at

Hdfs-site.xml

Mainly, to set the block replication factor

Page 29: Introduction to HDFS

29

Anatomy of a Job Launch

Page 30: Introduction to HDFS

30

Job Status updates

Page 31: Introduction to HDFS

31

Limitations of Hadoop -1Scalability

Maximum Cluster size – 4,000 nodes for best performance

Maximum Concurrent tasks- 40,000

Name Node as a single point of failure

Failure kills all running and queued jobs

Jobs need to be re-submitted by the user

Re-Start ability

Restart is very tricky due to complex state

Page 32: Introduction to HDFS

32

Who has the biggest cluster setupsFacebook 400

Microsoft 400

LinkedIn 4100

Yahoo 42,000

Page 34: Introduction to HDFS

34

THANK YOU