hadoop & hdfs final

31
Seminar Report Seminar Report titled HADOOP & HDFS HADOOP & HDFS Submitted BY Mr Indrajit Gohokar 7th/B Roll No:132 Oct, 2013-14 Department of Computer Technology  YESHWANT RAO CHAVAN COLLEGE OF ENGINEERING, Nagpur (An Autonomous Institution Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University)

Upload: mukesh-buradkar

Post on 14-Apr-2018

257 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 1/31

Seminar ReportSeminar Reporttitled

HADOOP & HDFSHADOOP & HDFS

Submitted

BY

Mr Indrajit Gohokar7th/B

Roll No:132

Oct, 2013-14

Department of Computer

Technology

 YESHWANTRAO CHAVAN COLLEGE OFENGINEERING, Nagpur

(An Autonomous Institution Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University)

Page 2: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 2/31

 YESHWANTRAO CHAVAN COLLEGE OF

ENGINEERINGNAGPUR

(An Autonomous Institution Affiliated to Rashtrasant Tukadoji

Maharaj Nagpur University)

Department of ComputerTechnology

(2013-14)

CertificateThis is to certify that the Seminar Report titled “

Hadoop & HDFS“ is submitted towards the

partial fulfillment of requirement of seminar

course in VII Semester, B.E.(Computer

Technology), degree awarded by Rashtrasant

Tukdoji Maharaj Nagpur University, Nagpur.

Submitted by:

Mr. Indrajit Dilip Gohokar (Roll No:132)

is approved.

Seminar Guide

Mr. P.DHAVAN 

Seminar Coordinator

Mrs.P.DESHKAR  

Head, Department of Computer Technology

ii

Page 3: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 3/31

Mr.A.R.PATIL BHAGAT

 Date:Oct/2013Place:Nagpur

 Abstract 

 Nowadays we encounter huge amounts of data be it from

 Facebook or Twitter.This huge amounts of data that is being 

 generated everyday is known as Big Data.Due to the vastness of 

the data we have somehow have tofind a way to analyse it int 

order to make any sense out of it.This analysis can be done

using Hadoop. Apache Hadoop is an open source software

 project that enables the distributed processing of large data

 sets across clusters of commodity servers.

This report focuses on the understanding of Apache Hadoop

and the HDFS. The Hadoop Distributed File System

(HDFS),a paradigm of Hadoop is designed to store very

large data sets reliably, and to stream those data sets at 

high bandwidth to user applications. In a large cluster,

thousands of servers both host directly attached storage and 

execute user application tasks. By distributing storage and 

computation across many servers, the resource can growwith demand while remaining economical at every size.

iii

Page 4: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 4/31

Table of Contents

 

Title

Page No.

1.0 Introduction..........................................................................

.........4

2.0 Background

Knowledge...............................................................5

2.0.1 Big Data around the

world..................................................5

2.0.2 Use of Cluster Architecture for parallel

processing..........5

3.0 Everything to know about Big

Data............................................9

3.0.1 What is Big

Data?................................................................9

3.0.2 How does Big Data look  

like?.............................................9

3.0.3The value of Big

Data..........................................................10

3.1 Apache Hadoop with HDFS & some

implementations..............12

iv

Page 5: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 5/31

3.1.1 Hadoop Distributed File Sytem

(HDFS)...........................12

3.1.1.1 Features of  

HDFS.......................................................123.1.1.2 Architecture of  

HDFS................................................13

3.1.1.3 Filesystem

Namespace................................................14

3.1.1.4 Data Organization and

replication............................15

3.1.2 Some implementations of  

Hadoop....................................15

3.1.2.1 The Oracle

implementation.......................................15

3.1.2.2 The Dell

implementation............................................17

4.0 Advantage and

Limitations........................................................19

4.0.1Advantages....................................................................

......19

4.0.2

Limitations..........................................................................2

0

5.0 Applications..........................................................................

.......21

6.0 Future

Scope................................................................................22

7.0 Conclusion.............................................................................

.......23

References....................................................................................

.......24

v

Page 6: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 6/31

List of Figures

Figure Number Figure Name Page No

1 Overview of Cluster  

Architecture

6

2 Technicians working on a

large Linux cluster at the

Chemnitz University of 

Technology, Germany

7

3 Value of big Data-Wipro

infographic

11

4 HDFS Architecture 14

5 Feeding Hadoop Data to the

Oracle Database

16

6 Oracle grid engine 16

7 Dell Hadoop design 17

8 Dell network implementation 18

vi

Page 7: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 7/31

1.0 Introduction

Big Data has become viable as cost-effective approaches have

emerged to tame the volume, velocity and variability of massive

data. Assuming that the volumes of data are larger than those

conventional relational database infrastructures can cope with,

 processing options break down broadly into massively parallel

 processing architectures — data warehouses or databases such

as Apache Hadoop-based solutions.

Hadoop provides a distributed file system(HDFS) and a

framework for the analysis and transformation of very

large data sets using the MapReduce paradigm.An important

characteristic of Hadoop is the partitioning of data and

computation across many (thousands) of hosts, and executing

application computations in parallel close to their data. A

Hadoop cluster scales computation capacity, storage capacity

and IO bandwidth by simply adding commodity servers.

So Hadoop definately plays an important role in managing &

making sense out of the Big Data.

vii

Page 8: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 8/31

2.0 Background Knowledge

2.0.1 Big Data around the world

The rise of the internet and Web 2.0 have resulted not only

an enormous increase in the amount of data created, but also in

the type of data. On the other hand, data is also collected,

organized, stored, managed, and, most importantly, analyzed to

enable and to accelerate discoveries in science.

Examples of Scientific Big Data include nuclear research data,

where the CERN Institute, the European Organization for 

 Nuclear Research, is a major contributor; all of the data reported

on the generation and consumption of all forms of energy on a

global scale, where Smart Grids are a tremendous source of that

data that is obtained from 350 billion annual meter readings [2].

The Large Hadron Collider (LHC), a particle accelerator that

will revolutionize our understanding of the workings of the

Universe, will generate 60 terabytes of data per day – 15

petabytes (15 million gigabytes) annually [1].

In the age of Web 2.0, 12 terabytes of Tweets are created each

day [2] and 100 terabytes of data uploaded is daily to Facebook 

[3].

Examples of Big Data in private sector include generation of 2.5

petabytes of data in an hour by 1 million customer transactions

 by Wallmart [3].

Thus Big Data is everywhere and the age of Big Data is upon

us. So successfully exploiting the value in big data requires

experimentation and exploration.

2.0.2 Use of Cluster Architecture for parallel

processing:

viii

Page 9: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 9/31

To effectively harness the power of Big Data, we need an

architecture that is distributed and will support parallel

 processing. It should cater to three needs namely

1. Volume – It should be able to handle the extensive

volume of Big Data.

2. Speed – It should be able to process and analyze data as

fast as possible.

3. Cost –All this should be with minimum cost.

A computer cluster (or Cluster Architecture) consists of a set of 

loosely connected computers that work together so that in manyrespects they can be viewed as a single system.

The components of a cluster are usually connected to each other 

through fast local area networks, each node (computer used as a

server) running its own instance of an operating system. Each

node consists of its own cores, memory and disks.

Computer clusters emerged as a result of convergence of a

number of computing trends including the availability of low

cost microprocessors, high speed networks, and software for 

high performance distributed computing.

The activities of the computing nodes are orchestrated by

"clustering middleware", a software layer that sits atop the

nodes and allows the users to treat the cluster as by and large

one cohesive computing unit.

ix

Page 10: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 10/31

Fig1: Overview of Cluster Architecture

Clusters are usually deployed to improve performance and

availability over that of a single computer, while typically being

much more cost-effective than single computers of comparable

speed or availability.[4]

Fig 2: Technicians working on a large Linux cluster at the Chemnitz

University of Technology, Germany

The advantages of cluster architecture are – 

• Modular and Scalable - easier to expand the system

without bringing down the application that runs on top

of the cluster.

x

Page 11: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 11/31

• Data Locality – where data can be processed by the

cores collocated in same node or Rack minimizing any

transfer over network.

• Parallelization - higher degree of parallelism via the

simultaneous execution of separate portions of a

program on different processors.

• Less cost – Built on the principle of commodity

hardware which is to have more low-performance, low-

cost hardware working in parallel (scalar computing)

than to have less high-performance, high-cost hardware.

However managing a Cluster has a few overheads which

include – 

• Complexity - Cost of administering a cluster of N

machines significantly increases complexity of using the

cluster.

• More Storage - As data is replicated to protect from

failure Cluster architecture requires more storage

capacity.

• Data Distribution and Task Scheduling  – When a

large multi-user cluster needs to access very large

amounts of data, task scheduling and Data Distribution

 becomes a challenge. However, given that in a complex

application environment the performance of each job

depends on the characteristics of the underlying cluster,

mapping tasks onto CPU cores and GPU devices

 provides significant challenges.[5]

• Careful Management and Need of massive parallel

processing Design - Automatic parallelization of 

 programs continues to remain a technical challenge. The

development and debugging of parallel programs on a

xi

Page 12: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 12/31

cluster requires parallel language primitives as well as

suitable tools.

3.0 Everything to know about Big Data

3.0.1 What is Big Data?

In information technology, “Big data refers to the datasets

whose size is beyond the ability of a typical database software

tools to capture, store, manage and analyze [8].”

O’Reilly defines big data the following way: “Big data is data

that exceeds the processing capacity of conventional database

systems. The data is too big, moves too fast, or doesn't fit the

strictures of your database architectures.”[6]

3.0.2What does the Big Data look like?

As a catch-all term, "Big Data" can be pretty nebulous.  Inputdata to big data systems could be chatter from social networks,

xii

Page 13: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 13/31

web server logs, traffic flow sensors, satellite imagery,

 broadcast audio streams, banking transactions, MP3s of rock 

music, the content of web pages, scans of government

documents, GPS trails, telemetry from automobiles, financialmarket data, the list goes on. Are these all really the same

thing?

To clarify matters, the three Vs of volume, velocity and 

variety are commonly used to characterize different aspects of 

 big data. They're a helpful lens through which to view and

understand the nature of the data and the software platforms

available to exploit them. Most probably you will contend witheach of the Vs to one degree or another.

1. Volume: Many factors contribute to the increase in data

volume – transaction-based data stored through the

years, text data constantly streaming in from social

media, increasing amounts of sensor data being

collected, etc. This volume presents the most immediate

challenge to conventional IT structures. It calls for 

scalable storage, and a distributed approach to querying.

Many companies already have large amounts of 

archived data, perhaps in the form of logs, but not the

capacity to process it.

2. Velocity: The importance of data's velocity — the

increasing rate at which data flows into an organization

 — has followed a similar pattern to that of volume.

Problems previously restricted to segments of industry

are now presenting themselves in a much broader 

setting. Specialized companies such as financial traders

have long turned systems that cope with fast moving

data to their advantage. Why is that so? The Internet and

mobile era means that the way we deliver and consumer 

 products and services is increasingly instrumented,

generating a data flow back to the provider.Those who

xiii

Page 14: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 14/31

are able to quickly utilize that information, by

recommending additional purchases, for instance, gain

competitive advantage. It's not just the velocity of the

incoming data that's the issue: it's possible to streamfast-moving data into bulk storage for later batch

 processing, for example. The importance lies in the

speed of the feedback loop, taking data from input

through to decision.[6]

3. Variety: Rarely does data present itself in a form

 perfectly ordered and ready for processing. A common

theme in big data systems is that the source data is

diverse, and doesn't fall into neat relational structures. It

could be text from social networks, image data, a raw

feed directly from a sensor source. None of these things

come ready for integration into an application. Even on

the web, where computer-to-computer communication

ought to bring some guarantees, the reality of data is

messy. A common use of big data processing is to take

unstructured data and extract ordered meaning, for 

consumption either by humans or as a structured input to

an application.

3.0.3 The value of Big Data

Big data is more than simply a matter of size; it is an

opportunity to find insights in new and emerging types of data

and content, to make your business more agile, and to answer 

questions that were previously considered beyond your reach.

The value of Big Data is increasing day by day. This

infographic by Wipro shows the value of Big Data in some

sectors[7]

xiv

Page 15: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 15/31

Fig 3.Value of big Data-Wipro infographic

Until now, there was no practical way to harvest this

opportunity. Today managing Big Data can be done effectively

and with an ease with emergence of state of art solutions like

Apache Hadoop.

xv

Page 16: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 16/31

3.1 Apache Hadoop with HDFS

& some implementations

Apache Hadoop is an open source Java framework for 

 processing and querying vast amounts of data on large clusters

of commodity hardware.[8]

It is an open Source Apache Project initiated and led by

Yahoo.It enables applications to work with thousands of 

computation-independent computers and petabytes of data.

Hadoop was derived from Google's Map-Reduce and Google

File System (GFS) papers. The entire Apache Hadoop

“platform” is now commonly considered to consist of the

Hadoop kernel, Map-Reduce and HDFS, as well as a number of 

related projects – including Apache Hive, Apache Hbase, and

others.

Hadoop was invented by Doug Cutting and funded by Yahoo in

2006 and reached to its “web scale capacity” in 2008[9].

The HDFS or the Hadoop Distributed File system is one of the

 paradigm of Hadoop framework other than the Map-Reduce.

3.1.1 Hadoop Distributed File Sytem (HDFS)

The Hadoop Distributed File System (HDFS) is a distributedfile system designed to run on commodity hardware. It has

many similarities with existing distributed file systems.

However, the differences from other distributed file systems are

significant. HDFS is highly fault-tolerant and is designed to be

deployed on low-cost hardware. HDFS provides high

throughput access to application data and is suitable for 

applications that have large data sets.

xvi

Page 17: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 17/31

3.1.1.1 Features of HDFS [10]

• Highly fault-tolerant: Hardware failure is the norm

rather than the exception. An HDFS instance may

consist of hundreds or thousands of server machines,

each storing part of the file system’s data. The fact that

there are a huge number of components and that each

component has a non-trivial probability of failure means

that some component of HDFS is always non-functional.

Therefore, detection of faults and quick, automatic

recovery from them is a core architectural goal of 

HDFS.

• Suitable for applications with large data sets:

Applications that run on HDFS have large data sets. A

typical file in HDFS is gigabytes to terabytes in size.

Thus, HDFS is tuned to support large files. It provides

high aggregate data bandwidth and scale to thousands of 

nodes in a single cluster. It should support tens of 

millions of files in a single instance.

• Streaming access to file system data: Applications that

run on HDFS need streaming access to their data sets.

They are not general purpose applications that typically

run on general purpose file systems. HDFS is designed

more for batch processing rather than interactive use by

users. The emphasis is on high throughput of data access

rather than low latency of data access.

• Portability across Heterogeneous Hardware and

Software Platforms: HDFS has been designed to

 be easily portable from one platform to another.

This facilitates widespread adoption of HDFS as a

 platform of choice for a large set of applications.

3.1.1.2 Architecture of HDFS

xvii

Page 18: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 18/31

Master/slave architecture. HDFS cluster consists of a single

 Namenode, a master server that manages the file system

namespace and regulates access to files by HDFS clients.Thereare a number of DataNodes usually one per node in a cluster.

The DataNodes manage storage attached to the nodes that they

run on. HDFS exposes a file system namespace and allows user 

data to be stored in files. A file is split into one or more blocks

and set of blocks are stored in DataNodes.

NameNode: Keeps image of entire file system

namespace and file Blockmap in memory. When the

 Namenode starts up it gets the FsImage and Editlog

from its local file system, update FsImage with EditLog

information and then stores a copy of the FsImage on

the filesytstem as a checkpoint.Periodic checkpointing is

done, so that the system can recover back to the last

checkpointed state in case of a crash[11].

DataNodes: A Datanode stores data in files in its local

file system. Datanode has no knowledge about HDFS

filesystem. It stores each block of HDFS data in a

separate file. The DataNodes are responsible for serving

read and write requests from the file system’s clients.

The DataNodes also perform block creation, deletion,

and replication upon instruction from the NameNode.

When the filesystem starts up it generates a list of all

HDFS blocks and send this report to Namenode:

Blockreport[11].

xviii

Page 19: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 19/31

Fig 4: HDFS Architecture

3.1.1.3 Filesystem Namespace

HDFS supports a traditional hierarchical file organization. A

user or an application can create directories and store files

inside these directories. The file system namespace hierarchy is

similar to most other existing file systems; one can create and

remove files, move a file from one directory to another, or 

rename a file. The NameNode maintains the file system

namespace. Any change to the file system namespace or its

 properties is recorded by the NameNode. An application can

specify the number of replicas of a file that should be

maintained by HDFS. The number of copies of a file is called

the replication factor of that file. This information is stored by

the NameNode[11].

3.1.1.4 Data Organization and replication [10]

HDFS supports write-once-read-many semantics on files. A

typical block size used by HDFS is 64 MB. Thus, an HDFS file

xix

Page 20: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 20/31

is chopped up into 64 MB chunks, and if possible, each chunk 

will reside on a different DataNode.

HDFS is designed to reliably store very large files acrossmachines in a large cluster. It stores each file as a sequence of 

 blocks; all blocks in a file except the last block are the same

size. The blocks of a file are replicated for fault tolerance. The

 block size and replication factor are configurable per file. An

application can specify the number of replicas of a file. The

 NameNode makes all decisions regarding replication of blocks.

It periodically receives a Heartbeat and a Blockreport from each

of the DataNodes in the cluster. Receipt of a Heartbeat implies

that the DataNode is functioning properly[11].

3.1.2 Some Implementations of Hadoop

Some of the big corporations around the world have

implemented Hadoop for their own use.These corporationsinclude Yahoo,IBM,Facebook,Google Dell,Oracle,Cloudera

and so on.These major corporations obviously have understood

the might of Hadoop.

3.1.2.1 The Oracle Implementation

1)Feeding Hadoop Data to the Database for Further

Analysis[12]

External tables present data stored in a file system in a table

format, and can be used in SQL queries transparently. Hadoop

data stored in HDFS can be accessed from inside the Oracle

Database by using External Tables through the use of FUSE

(File system in User Space) Using the External Table makes it

easier for non-programmers to work with Hadoop data from

inside an Oracle Database.

xx

Page 21: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 21/31

Fig 5.  Feeding Hadoop Data to the Oracle

Database

2)Drive Enterprise-Level Operational Efficiency of Hadoop

Infrastructure with Oracle Grid Engine[12]

All the computing resources allocated to a Hadoop cluster are

used exclusively by Hadoop, which can result in havingunderutilized resources when Hadoop is not running. Oracle

Grid Engine enables a cluster of computers to run Hadoop with

other data-oriented compute application models.The benefit of 

this approach is that you don’t have to maintain a dedicated

cluster for running only Hadoop applications.

Fig 6. Oracle Grid Engine

xxi

Page 22: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 22/31

3.1.2.2 The Dell Implentation

1)Hadoop design implementation[13]

The representation is broken down into the Hadoop use cases

such as Compute, Storage, and Database workloads.Each

workload has specific characteristics for operations,

deployment, architecture, and management.

Fig 7. Dell Hadoop

design.

2)Network implementation[13]

Top-of-rack (ToR) switches in a network architecture connect

directly to the DataNodes and allow for all inter-node

communication within the Hadoop environment. Hadoop

networks should utilize inexpensive components that are

employed in a way that maximizes performance for DataNode

communication.

xxii

Page 23: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 23/31

Fig 8.Dell Network  

implementation

3)Performance Benchmarks[13]

Within the Hadoop software ecosystem, there are several

 benchmark tools included that can be used for these

comparisons.

1. Teragen

Utilizes the parallel framework within Hadoop to

quickly create large data sets that can be manipulated.

2. Terasort

Read the data created by Teragen into the system’s

 physical memory and then sort it and write it back out to

the HDFS.

3. Teravalidate

Ensures that the data produced by Terasort is accurate

without any errors

xxiii

Page 24: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 24/31

4.0 Advantages and Limitations

4.0.1 Advantages :

• Distribute data and computation. Data Locality principle

avoids network overload.

• Tasks are independent[14]

• So it’s easy to handle partial failures - entire

nodes can fail and restart without shutting down

entire system.

• Avoid crawling horrors of failure-tolerant

synchronous distributed systems.

• Linear scaling in the ideal case

• Designed for cheap, commodity hardware.

• Simple programming model. The “end-user” programmer 

only writes map-reduce tasks and the overhead of 

 programming other tasks is reduced.

• Flat Scalability: One of the major benefits of using Hadoop

in contrast to other distributed systems is its flat scalability

curve. A program written in distributed frameworks other 

than Hadoop may require large amounts of refactoring when

scaling from ten to one hundred or one thousand machines.

After a Hadoop program is written and functioning on ten

xxiv

Page 25: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 25/31

nodes, very little--if any--work is required for that same

 program to run on a much larger amount of hardware.[14]

4.0.2 Limitations:

• Programming model is very restrictive. Lack of central

data can be frustrating.

• Still rough - software under active development. e.g.

HDFS only recently added support for append

operations [14]

• “Joins” of multiple datasets are tricky and slow. Often,

entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing

software, collecting logs... is hard)

• Still single master, which requires care and may limit

scaling

• Managing job flow isn’t trivial when intermediate data

should be kept.

• Multiple copies of already big data are created.[15]

• Limited SQL support.[15]

• Inefficient execution as HDFS has no notion of query

optimizer.[15]

• Lack of required skills necessary for handling Hadoop.

[15]

xxv

Page 26: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 26/31

5.0 Application of Hadoop

Hadoop is leading choice of the companies when it comes in

managing Big Data. Some of the major applications and

companies that use Hadoop are [16]

• IBM offers InfoSphere Big Insights based on.

• Facebook uses Hadoop to store copies of internal log and

dimension data sources and use it as a source for 

reporting/analytics and machine learning.

Currently Facebook has 2 major clusters.

• EBay uses

o Heavy usage of Java Map-Reduce for Search

optimization and Research.

• Yahoo! has

o More than 100,000 CPUs in >40,000 computers

running Hadoop

xxvi

Page 27: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 27/31

o Its biggest cluster: 4500 nodes (2*4cpu boxes with

4*1TB disk & 16GB RAM)

• Other companies include Adobe, Amazon, The New York 

Times, Hewlett-Packard and list keeps on increasing.

6.0 Future Scope

Hadoop is growing really fast and so its use in industries that

are gearing up of Big Data Challenge. The Future of Hadoop

definately involves [9] 

• Better Scheduling among the nodes for better resource

allocation and control of resources.

• Splitting Core into sub-projects like Apache Hive, Apache

Cassandra, PIG etc.

• Improved API for developers in Java.

• HDFS security.

• Upgrading to Hadoop 1.0.

xxvii

Page 28: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 28/31

7.0 Conclusion

Apache Hadoop is 100% open source, and pioneered a

fundamentally new way of storing and processing data. Instead

of relying on expensive, proprietary hardware and different

systems to store and process data, Hadoop enables distributed

 parallel processing of huge amounts of data across inexpensive,

industry-standard servers that both store and process the data,

and can scale without limits. With Hadoop, no data is too big.

This report focuses on the various properties, challenges and

opportunities of Big Data.It also discusses how we can use

Apache Hadoop to manage the Big Data and extract value out

of it. A detailed discussion has been done on Hadoop and its

underlying technology. So in today’s hyper-connected worldwhere more and more data is being created every day, the need

xxviii

Page 29: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 29/31

for Hadoop is no longer a question. The only question now is

how to take advantage of it best.

References

[1] Randal E. Bryant, Randy H. Katz, Edward D. Lazowska,

“Big-Data Computing: Creating revolutionary

 breakthroughs in commerce, science, and society”, Version

8: December 22, 2008. Available:

http://www.cra.org/ccc/docs/init/Big_Data.pdf 

[2] What is Big Data, Bringing Big data to enterprise

[Online] Available:http://www-

01.ibm.com/software/data/bigdata/

[3] A Comprehensive List of Big Data Statistics [Online].

Available:http://wikibon.org/blog/big-data-statistics/

xxix

Page 30: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 30/31

[4] D.A. Bader and R. Pennington, ``Cluster Computing:

Applications,'' The International Journal of High

Performance Computing, 15(2):181-185, May

2001.Available:http://www.cc.gatech.edu/~bader/papers/ijh pca.pdf 

[5]  K. Shirahata, “Hybrid Map Task Scheduling for GPU-

Based Heterogeneous Clusters” in: Cloud Computing

Technology and Science (CloudCom), 2010 Nov.30 2010-

Dec.3 2010 pages 733 – 740 [Online].Available:

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?

arnumber=5708524

[6]What Is Big Data? O’Reilly Radar, January11, 2012,

[Online].Available: http://radar.oreilly.com/2012/01/what-

is-big-data.html

[7]BigData,Wipro,

[Online].Available:http://www.slideshare.net/wiprotechnolo

gies/wipro-infographicbig-data

[8]Hadoop at Yahoo!, Yahoo developer Network 

[Online].Available:

http://developer.yahoo.com/hadoop/

[9] Owan o maley,”Introduction

to Hadoop”[Online].Available: http://wiki.apache.org/hado

op/

[10] HDFS Architecture, Hadoop 0.20 Documentation

[Online].

Available:http://hadoop.apache.org/docs/r0.20.2/hdfs_desig

n.html

[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,

Robert Chansler,” The Hadoop Distributed File System

xxx

Page 31: Hadoop & HDFS Final

7/27/2019 Hadoop & HDFS Final

http://slidepdf.com/reader/full/hadoop-hdfs-final 31/31

,Available:http://storageconference.org/2010/Papers/MSST/

Shvachko.pdf 

[12] Oracle and Hadoop Overview [Online] Available:

http://www.oracle.com/technetwork/database/bi-

datawarehousing/twp-hadoop-oracle-194542.pdf 

[13] Introduction to Hadoop – Dell [Online]

Available:http://i.dell.com/sites/content/business/solutions/

whitepapers/en/Documents/hadoop-introduction.pdf 

[14] Yahoo! Hadoop tutorial,Yahoo Developer 

 Network(YDN),[Online].

Available:http://developer.yahoo.com/hadoop/tutorial/modul

e1.html#comparison

[15] Hadoop's Limitations for Big Data Analytics , 

ParAccel.Inc [On

line]

Available:http://www.paraccel.com/resources/Whitepapers/

Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf 

[16]Hadoop Wiki, Powered By,[Online]

Available: http://wiki.apache.org/hadoop/PoweredBy#I