day7.hdfs & architecture

24
http://www.excelonlineclasses .co.nr/ http://www.excelonlineclasses.co.nr/

Upload: riteshaladdin

Post on 17-Jul-2016

230 views

Category:

Documents


0 download

DESCRIPTION

Day7.HDFS & Architecture

TRANSCRIPT

Page 1: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

http://www.excelonlineclasses.co.nr/

[email protected]

Page 2: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Online Training Development Testing Job support Technical Guidance Job Consultancy Any needs of IT Sector

Excel Online Classes offers following services:

Page 3: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

HDFS

- Nagarjuna K

Page 4: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

HDFS Distributed FS designed to run on

Commodity Hardware

Provides high throughput access to application data , suitable for applications having large datasets

Page 5: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Assumptions & Goals Hardware Failure Streaming Data Access Large Datasets Simple coherency Model Moving Computation cheaper than

moving data

Page 6: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Hardware Failure Assumptions & Goals

HDFS instance many machines Each storing part of the data

Chances that any machine goes down can’t be avoided

Detection of faults, auto recovery is core architectural goal of HDFS

Page 7: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Streaming Data Access Assumptions & Goals

HDFS is designed fro batch processing rather than interactive usage by users.

Emphasis on Data throughput Not on low Latency data access.

Page 8: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Streaming Data Access Assumptions & Goals

HDFS built on !dea “Write once , Read many times pattern”

Overtime data set generated and placed in HDFS Analysis is done one large part of data , rather

than on first few records Time to read whole data set is more than

retrieving first or the last record.

Page 9: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Large Datasets Assumptions & Goals

A typical file ranges from GB to TB

Page 10: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Simple Coherency Model Assumptions & Goals

HDFS built on !dea “Write once , Read many times pattern”

The assumption enables high through put access

Page 11: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Moving Computation OR Data ? Assumptions & Goals

Computation intensive porgraming

Data intensive programing

Page 12: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Low latency data access

Lots of small files

Multiple writers, arbitrary file modifications

Page 13: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Low latency data access

Lots of small files High latency time Each file (say 10 KB of size) takes up a block

in HDFS Compress All the metadata is stored in HDFS memory

Page 14: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Where HDFS doesn’t fit Multiple writers, arbitrary file

modifications Single user writes files in HDFS.

Appending only at the end. Multiple sources of writing into a same file or writing at arbitrary offset is not supported (currently)

Page 15: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Blocks disc has block size

minimum amount of data that is read/write

512 bytes FileSystem blocks are few multiple of

disc block size few KB

Page 16: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Blocks In classical FS, single block may

contain data of only single file Leads to internal fragmentation.

Newer file systems, solves this problem by block suballocation tail merging

Page 17: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Blocks HDFS also has a block size

64 MB

Unlike normal FS , if file is less than 64 MB it doesn’t occupy underlying storage of 64MB.

Page 18: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ? Throughput vs Latency

time to seek start of block Reading the whole block

Page 19: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Why BIG BLOCK size ? seek time = 10ms transfer rate (throughput) = 100MBPS

make seek time 1% of transfer rate , block size = 100MB

Default is 64 MB As the transfer rate increases , Block

size can be increased

Page 20: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

hadoop fsck / -files -blocks Gives information about all the files and

blocks in the file system Replication

▪ under▪ over etc.,

corrupt ? etc.,

Page 21: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

File Permissions on HDFS Client’s identity determined

user name and groups from which it operates.

Sharing of FS shouldn’t be used hostile environment

Going forward Kerberos authentication

Page 22: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Hadoop File Systems HDFS is just one implementation of

Hadoop FileSystems. org.apache.hadoop.fs.FileSystem

represents a FileSystem in hadoop

Page 23: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Hadoop File Systems

Page 24: Day7.HDFS & Architecture

http://www.excelonlineclasses.co.nr/

Hadoop File Systems