hadoop distriubted file system (hdfs) presentation 27- 5-2015

In The Name of Allah The Most Merciful The Most Gracious

• Name: Abdul Nasir Afridi• Roll Number:01• Batch#10• Subject: Advanced Database And Data

mining.Page-1

Research Article1. Performance Evaluation of Read and

Write operations in Hadoop Distributed File System.

Published: 2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming

Conference Paper: IEEE Computer Society

Authors: Dr T Ragunathan et al.

7B-2

Research Article • High Performance and Fault Tolerant

Distributed File System for Big Data Storage and Processing using Hadoop

• Published: 2014 International Conference on Intelligent Computing Applications

• © 2014 IEEE Conference Publishing Services

7B-3

Research Article• A Distributed Storage Model for EHR

Based on HBase

• Published: © 2011 IEEE International Conference on Information Management, Innovation Management and Industrial Engineering

7B-4

Research Article

7B-5

H-Store: A High-Performance, Distributed Main Memory Transaction Processing System

Published: August 23-28, 2008, Auckland, New Zealand Conference Paper:ACM 978-1-60558-306-8/08/08 Copyright 2008 VLDB Endowment,

• Keywords-• Hadoop Distributed File System(HDFS);• H-Base• Electronic healthcare record(EHR)• Distritued Storage• Big Data • MapReduce

7B-6

What is Apache Hadoop?• Hadoop Distributed File System:• HDFS, the storage layer of Hadoop, is a

distributed, scalable, Java-based file system adept at storing large volumes of unstructured data

• It is an open-source system developed by Apache in Java.

• It is designed to handle very large data sets.• It is designed to scale to very large clusters.• It is designed to run on commodity hardware.

7B-7

Hadoop echosystem

7B-8

Hadoop History

7B-9

Hadoop Echosystem

7B-10

Hadoop echosystem

7B-11

Hadoop echosystem• Hadoop Distributed File System:HDFS, the

storage layer of • Hadoop, is a distributed, scalable, Java-based

file system.• It offers data replication. • It offers automatic failover in the event of a

crash. •• It automatically fragments storage over the

cluster. •• It brings processing to the data. •• Its supportlarge volumes of file into the milion

7B-12

Hadoop echosystem• MapReduce:• MapReduce is a software framework that

serves as the compute layer of Hadoop.• MapReduce jobs are divided into two

parts.The mapfunction divides a query into multiple parts and processes data at the node level.

• The reducefunction aggregates the results of the map function to determine the answer to the query.

7B-13

Hadoop echosystem• Hive:Hive is a Hadoop-based data warehouse

developed by Facebook. It allows users to write queries in SQL, which are then converted to map-reduce. This allows SQL programmers with no map-reduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Micro Strategy, Tableau, Revolutions Analytics, etc

7B-14

Hadoop echosystem• Pig:Pig Latin is a Hadoop-based language

developed by Yahoo. It is relatively easy to learn and is adept at very

deep, very long data pipelines (a limitation of SQL.)

Pig, originally developed at Yahoo research, is a high-level language for building map-reduce programs for Hadoop,

thus simplifying the use of map-reduce. It is a data flow language that provides high-level commands

7B-15

Hadoop echosystem

7B-16

Hadoop echosystem• HBase:• HBase is a non-relational database that

allows for low-latency, quick lookups in Hadoop.

• It adds transactional capabilities to Hadoop, allowing users to conduct updates,inserts, and deletes.

• E-Bay and Facebook use HBase heavily

7B-17

Hadoop echosystem• Flume:• Flume is a framework for populating

Hadoop with data.• Agents are populated throughout ones’

IT infrastructure (inside web servers, application servers, and mobile devices, for example) to collect data and integrate it into Hadoop.

7B-18

Hadoop echosystem• Oozie:• Oozie is a workflow processing system that

lets users define a series of jobs written in multiple languages (such as mapreduce, Pig and Hive) then intelligently links them to one another.

• Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed

7B-19

Hadoop echosystem• Whirr:• Whirr is a set of libraries that allows

users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure.

• It supports all major virtualized infrastructure vendors on the market

7B-20

Hadoop echosystem• Avro:• Avro is a data serialization system that

allows for encoding the schema of Hadoop files.

• It is adept at parsing data and performing removed procedure calls.

7B-21

Hadoop echosystem• Mahout:• Mahout is a data-mining library.• It takes the most popular data-mining

algorithms for performing clustering, regression testing, and statistical modeling

• and implements them using the map-reduce mode

7B-22

Hadoop echosystem• Sqoop:• Sqoop is a connectivity tool for moving data

from non-Hadoop data stores such as relational databases and data warehouses into Hadoop.

• It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata, or other relational databases to the target

7B-24

Hadoop Configuration File

7B-25

Data Ingress And Egress

7B-26

Joining Type Venn Diagram

7B-27

Big dataBig data is being generated by everything

around us at all times. Every digital process and social media

exchange produces it. Systems, sensors and mobile devices

transmit it. Big data is arriving from multiple sources at an

alarming velocity, volume and variety. To extract meaningful value from big data,

you need optimal processing power, analytics capabilities and skills.

7B-28

Big Data

7B-29

Typical Hadoop cluster integrates MapReduce and HFDS

Master/slave architecture

7B-30

Pictorial Representation Hadoop

7B-31

Physical Architecture of Hadoop echosystem

7B-32

HDFS

7B-33

MapReduce

7B-34

HDFS Namenode

7B-35

Scheduling• By default▫ Hadoop uses FIFO to schedule jobs. ▫  No preemption once a job is running.In Hadoop version 2.x fair scheduling

introduces.assigning resources to applications such that all applications get, on average, an equal share of resources over time

7B-36

Hadoop Implementation

7B-37

References• Reference• The Ministry of Health of P . R. China.

Health records infrastructure and data standards.[CP/OL].[ 2009 05] http://www.moh.gov.cn/publicfiles/business/cmsresources/mohbgt/cmsrsdocument/doc4359.doc

• Jonathan R. Owens. Hadoop Real-World Solutions Cookbook Copyright© 2013 Packt Publishing

7B-38

References• HDFS:Architecture[OL].http://

hadoop.apache.org/ • Terabyte sort[OL]. http://sortbenchmark.org/. • T. White, Hadoop: The Definitive Guide.

O'Reilly Media, Yahoo! Press, June 5, 2009.• Mahesh, Bharath, Keerthivasan, “Review of

Distributed File Systems: Concepts and Case Studies” ECE 677 Distributed Computing Systems - Fall 2010

• Jeff Markham , Apache Hadoop™ YARN.• Addison-Wesley Press ,2014

7B-39

References• Eric Sammer ,Hadoop Operations

Copyright © 2012 Published by O’Reilly Media

• Kevin Sitto and Marshall Presser,Field Guide to Hadoop, Copyright © 2015, Published by O’Reilly Media

• John Wiley & Sons, NoSQL For Dummies® New Jersey Media and software compilation copyright © 2015

7B-40

hadoop distriubted file system (hdfs) presentation 27- 5-2015

Education