hadoop distriubted file system (hdfs) presentation 27- 5-2015
TRANSCRIPT
In The Name of Allah The Most Merciful The Most Gracious
• Name: Abdul Nasir Afridi• Roll Number:01• Batch#10• Subject: Advanced Database And Data
mining.Page-1
Research Article1. Performance Evaluation of Read and
Write operations in Hadoop Distributed File System.
Published: 2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming
Conference Paper: IEEE Computer Society
Authors: Dr T Ragunathan et al.
7B-2
Research Article • High Performance and Fault Tolerant
Distributed File System for Big Data Storage and Processing using Hadoop
• Published: 2014 International Conference on Intelligent Computing Applications
• © 2014 IEEE Conference Publishing Services
7B-3
Research Article• A Distributed Storage Model for EHR
Based on HBase
• Published: © 2011 IEEE International Conference on Information Management, Innovation Management and Industrial Engineering
7B-4
Research Article
7B-5
H-Store: A High-Performance, Distributed Main Memory Transaction Processing System
Published: August 23-28, 2008, Auckland, New Zealand Conference Paper:ACM 978-1-60558-306-8/08/08 Copyright 2008 VLDB Endowment,
• Keywords-• Hadoop Distributed File System(HDFS);• H-Base• Electronic healthcare record(EHR)• Distritued Storage• Big Data • MapReduce
7B-6
What is Apache Hadoop?• Hadoop Distributed File System:• HDFS, the storage layer of Hadoop, is a
distributed, scalable, Java-based file system adept at storing large volumes of unstructured data
• It is an open-source system developed by Apache in Java.
• It is designed to handle very large data sets.• It is designed to scale to very large clusters.• It is designed to run on commodity hardware.
7B-7
Hadoop echosystem
7B-8
Hadoop History
7B-9
Hadoop Echosystem
7B-10
Hadoop echosystem
7B-11
Hadoop echosystem• Hadoop Distributed File System:HDFS, the
storage layer of • Hadoop, is a distributed, scalable, Java-based
file system.• It offers data replication. • It offers automatic failover in the event of a
crash. •• It automatically fragments storage over the
cluster. •• It brings processing to the data. •• Its supportlarge volumes of file into the milion
7B-12
Hadoop echosystem• MapReduce:• MapReduce is a software framework that
serves as the compute layer of Hadoop.• MapReduce jobs are divided into two
parts.The mapfunction divides a query into multiple parts and processes data at the node level.
• The reducefunction aggregates the results of the map function to determine the answer to the query.
7B-13
Hadoop echosystem• Hive:Hive is a Hadoop-based data warehouse
developed by Facebook. It allows users to write queries in SQL, which are then converted to map-reduce. This allows SQL programmers with no map-reduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Micro Strategy, Tableau, Revolutions Analytics, etc
7B-14
Hadoop echosystem• Pig:Pig Latin is a Hadoop-based language
developed by Yahoo. It is relatively easy to learn and is adept at very
deep, very long data pipelines (a limitation of SQL.)
Pig, originally developed at Yahoo research, is a high-level language for building map-reduce programs for Hadoop,
thus simplifying the use of map-reduce. It is a data flow language that provides high-level commands
7B-15
Hadoop echosystem
7B-16
Hadoop echosystem• HBase:• HBase is a non-relational database that
allows for low-latency, quick lookups in Hadoop.
• It adds transactional capabilities to Hadoop, allowing users to conduct updates,inserts, and deletes.
• E-Bay and Facebook use HBase heavily
7B-17
Hadoop echosystem• Flume:• Flume is a framework for populating
Hadoop with data.• Agents are populated throughout ones’
IT infrastructure (inside web servers, application servers, and mobile devices, for example) to collect data and integrate it into Hadoop.
7B-18
Hadoop echosystem• Oozie:• Oozie is a workflow processing system that
lets users define a series of jobs written in multiple languages (such as mapreduce, Pig and Hive) then intelligently links them to one another.
• Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed
7B-19
Hadoop echosystem• Whirr:• Whirr is a set of libraries that allows
users to easily spin-up Hadoop clusters on top of Amazon EC2, Rackspace, or any virtual infrastructure.
• It supports all major virtualized infrastructure vendors on the market
7B-20
Hadoop echosystem• Avro:• Avro is a data serialization system that
allows for encoding the schema of Hadoop files.
• It is adept at parsing data and performing removed procedure calls.
7B-21
Hadoop echosystem• Mahout:• Mahout is a data-mining library.• It takes the most popular data-mining
algorithms for performing clustering, regression testing, and statistical modeling
• and implements them using the map-reduce mode
7B-22
7B-23
Hadoop echosystem• Sqoop:• Sqoop is a connectivity tool for moving data
from non-Hadoop data stores such as relational databases and data warehouses into Hadoop.
• It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata, or other relational databases to the target
7B-24
Hadoop Configuration File
7B-25
Data Ingress And Egress
7B-26
Joining Type Venn Diagram
7B-27
Big dataBig data is being generated by everything
around us at all times. Every digital process and social media
exchange produces it. Systems, sensors and mobile devices
transmit it. Big data is arriving from multiple sources at an
alarming velocity, volume and variety. To extract meaningful value from big data,
you need optimal processing power, analytics capabilities and skills.
7B-28
Big Data
7B-29
Typical Hadoop cluster integrates MapReduce and HFDS
Master/slave architecture
7B-30
Pictorial Representation Hadoop
7B-31
Physical Architecture of Hadoop echosystem
7B-32
HDFS
7B-33
MapReduce
7B-34
HDFS Namenode
7B-35
Scheduling• By default▫ Hadoop uses FIFO to schedule jobs. ▫ No preemption once a job is running.In Hadoop version 2.x fair scheduling
introduces.assigning resources to applications such that all applications get, on average, an equal share of resources over time
7B-36
Hadoop Implementation
7B-37
References• Reference• The Ministry of Health of P . R. China.
Health records infrastructure and data standards.[CP/OL].[ 2009 05] http://www.moh.gov.cn/publicfiles/business/cmsresources/mohbgt/cmsrsdocument/doc4359.doc
• Jonathan R. Owens. Hadoop Real-World Solutions Cookbook Copyright© 2013 Packt Publishing
7B-38
References• HDFS:Architecture[OL].http://
hadoop.apache.org/ • Terabyte sort[OL]. http://sortbenchmark.org/. • T. White, Hadoop: The Definitive Guide.
O'Reilly Media, Yahoo! Press, June 5, 2009.• Mahesh, Bharath, Keerthivasan, “Review of
Distributed File Systems: Concepts and Case Studies” ECE 677 Distributed Computing Systems - Fall 2010
• Jeff Markham , Apache Hadoop™ YARN.• Addison-Wesley Press ,2014
7B-39
References• Eric Sammer ,Hadoop Operations
Copyright © 2012 Published by O’Reilly Media
• Kevin Sitto and Marshall Presser,Field Guide to Hadoop, Copyright © 2015, Published by O’Reilly Media
• John Wiley & Sons, NoSQL For Dummies® New Jersey Media and software compilation copyright © 2015
7B-40
7B-41
7B-42