hadoop & hdfs final
TRANSCRIPT
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 1/31
Seminar ReportSeminar Reporttitled
HADOOP & HDFSHADOOP & HDFS
Submitted
BY
Mr Indrajit Gohokar7th/B
Roll No:132
Oct, 2013-14
Department of Computer
Technology
YESHWANTRAO CHAVAN COLLEGE OFENGINEERING, Nagpur
(An Autonomous Institution Affiliated to Rashtrasant Tukadoji Maharaj Nagpur University)
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 2/31
YESHWANTRAO CHAVAN COLLEGE OF
ENGINEERINGNAGPUR
(An Autonomous Institution Affiliated to Rashtrasant Tukadoji
Maharaj Nagpur University)
Department of ComputerTechnology
(2013-14)
CertificateThis is to certify that the Seminar Report titled “
Hadoop & HDFS“ is submitted towards the
partial fulfillment of requirement of seminar
course in VII Semester, B.E.(Computer
Technology), degree awarded by Rashtrasant
Tukdoji Maharaj Nagpur University, Nagpur.
Submitted by:
Mr. Indrajit Dilip Gohokar (Roll No:132)
is approved.
Seminar Guide
Mr. P.DHAVAN
Seminar Coordinator
Mrs.P.DESHKAR
Head, Department of Computer Technology
ii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 3/31
Mr.A.R.PATIL BHAGAT
Date:Oct/2013Place:Nagpur
Abstract
Nowadays we encounter huge amounts of data be it from
Facebook or Twitter.This huge amounts of data that is being
generated everyday is known as Big Data.Due to the vastness of
the data we have somehow have tofind a way to analyse it int
order to make any sense out of it.This analysis can be done
using Hadoop. Apache Hadoop is an open source software
project that enables the distributed processing of large data
sets across clusters of commodity servers.
This report focuses on the understanding of Apache Hadoop
and the HDFS. The Hadoop Distributed File System
(HDFS),a paradigm of Hadoop is designed to store very
large data sets reliably, and to stream those data sets at
high bandwidth to user applications. In a large cluster,
thousands of servers both host directly attached storage and
execute user application tasks. By distributing storage and
computation across many servers, the resource can growwith demand while remaining economical at every size.
iii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 4/31
Table of Contents
Title
Page No.
1.0 Introduction..........................................................................
.........4
2.0 Background
Knowledge...............................................................5
2.0.1 Big Data around the
world..................................................5
2.0.2 Use of Cluster Architecture for parallel
processing..........5
3.0 Everything to know about Big
Data............................................9
3.0.1 What is Big
Data?................................................................9
3.0.2 How does Big Data look
like?.............................................9
3.0.3The value of Big
Data..........................................................10
3.1 Apache Hadoop with HDFS & some
implementations..............12
iv
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 5/31
3.1.1 Hadoop Distributed File Sytem
(HDFS)...........................12
3.1.1.1 Features of
HDFS.......................................................123.1.1.2 Architecture of
HDFS................................................13
3.1.1.3 Filesystem
Namespace................................................14
3.1.1.4 Data Organization and
replication............................15
3.1.2 Some implementations of
Hadoop....................................15
3.1.2.1 The Oracle
implementation.......................................15
3.1.2.2 The Dell
implementation............................................17
4.0 Advantage and
Limitations........................................................19
4.0.1Advantages....................................................................
......19
4.0.2
Limitations..........................................................................2
0
5.0 Applications..........................................................................
.......21
6.0 Future
Scope................................................................................22
7.0 Conclusion.............................................................................
.......23
References....................................................................................
.......24
v
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 6/31
List of Figures
Figure Number Figure Name Page No
1 Overview of Cluster
Architecture
6
2 Technicians working on a
large Linux cluster at the
Chemnitz University of
Technology, Germany
7
3 Value of big Data-Wipro
infographic
11
4 HDFS Architecture 14
5 Feeding Hadoop Data to the
Oracle Database
16
6 Oracle grid engine 16
7 Dell Hadoop design 17
8 Dell network implementation 18
vi
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 7/31
1.0 Introduction
Big Data has become viable as cost-effective approaches have
emerged to tame the volume, velocity and variability of massive
data. Assuming that the volumes of data are larger than those
conventional relational database infrastructures can cope with,
processing options break down broadly into massively parallel
processing architectures — data warehouses or databases such
as Apache Hadoop-based solutions.
Hadoop provides a distributed file system(HDFS) and a
framework for the analysis and transformation of very
large data sets using the MapReduce paradigm.An important
characteristic of Hadoop is the partitioning of data and
computation across many (thousands) of hosts, and executing
application computations in parallel close to their data. A
Hadoop cluster scales computation capacity, storage capacity
and IO bandwidth by simply adding commodity servers.
So Hadoop definately plays an important role in managing &
making sense out of the Big Data.
vii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 8/31
2.0 Background Knowledge
2.0.1 Big Data around the world
The rise of the internet and Web 2.0 have resulted not only
an enormous increase in the amount of data created, but also in
the type of data. On the other hand, data is also collected,
organized, stored, managed, and, most importantly, analyzed to
enable and to accelerate discoveries in science.
Examples of Scientific Big Data include nuclear research data,
where the CERN Institute, the European Organization for
Nuclear Research, is a major contributor; all of the data reported
on the generation and consumption of all forms of energy on a
global scale, where Smart Grids are a tremendous source of that
data that is obtained from 350 billion annual meter readings [2].
The Large Hadron Collider (LHC), a particle accelerator that
will revolutionize our understanding of the workings of the
Universe, will generate 60 terabytes of data per day – 15
petabytes (15 million gigabytes) annually [1].
In the age of Web 2.0, 12 terabytes of Tweets are created each
day [2] and 100 terabytes of data uploaded is daily to Facebook
[3].
Examples of Big Data in private sector include generation of 2.5
petabytes of data in an hour by 1 million customer transactions
by Wallmart [3].
Thus Big Data is everywhere and the age of Big Data is upon
us. So successfully exploiting the value in big data requires
experimentation and exploration.
2.0.2 Use of Cluster Architecture for parallel
processing:
viii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 9/31
To effectively harness the power of Big Data, we need an
architecture that is distributed and will support parallel
processing. It should cater to three needs namely
1. Volume – It should be able to handle the extensive
volume of Big Data.
2. Speed – It should be able to process and analyze data as
fast as possible.
3. Cost –All this should be with minimum cost.
A computer cluster (or Cluster Architecture) consists of a set of
loosely connected computers that work together so that in manyrespects they can be viewed as a single system.
The components of a cluster are usually connected to each other
through fast local area networks, each node (computer used as a
server) running its own instance of an operating system. Each
node consists of its own cores, memory and disks.
Computer clusters emerged as a result of convergence of a
number of computing trends including the availability of low
cost microprocessors, high speed networks, and software for
high performance distributed computing.
The activities of the computing nodes are orchestrated by
"clustering middleware", a software layer that sits atop the
nodes and allows the users to treat the cluster as by and large
one cohesive computing unit.
ix
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 10/31
Fig1: Overview of Cluster Architecture
Clusters are usually deployed to improve performance and
availability over that of a single computer, while typically being
much more cost-effective than single computers of comparable
speed or availability.[4]
Fig 2: Technicians working on a large Linux cluster at the Chemnitz
University of Technology, Germany
The advantages of cluster architecture are –
• Modular and Scalable - easier to expand the system
without bringing down the application that runs on top
of the cluster.
x
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 11/31
• Data Locality – where data can be processed by the
cores collocated in same node or Rack minimizing any
transfer over network.
• Parallelization - higher degree of parallelism via the
simultaneous execution of separate portions of a
program on different processors.
• Less cost – Built on the principle of commodity
hardware which is to have more low-performance, low-
cost hardware working in parallel (scalar computing)
than to have less high-performance, high-cost hardware.
However managing a Cluster has a few overheads which
include –
• Complexity - Cost of administering a cluster of N
machines significantly increases complexity of using the
cluster.
• More Storage - As data is replicated to protect from
failure Cluster architecture requires more storage
capacity.
• Data Distribution and Task Scheduling – When a
large multi-user cluster needs to access very large
amounts of data, task scheduling and Data Distribution
becomes a challenge. However, given that in a complex
application environment the performance of each job
depends on the characteristics of the underlying cluster,
mapping tasks onto CPU cores and GPU devices
provides significant challenges.[5]
• Careful Management and Need of massive parallel
processing Design - Automatic parallelization of
programs continues to remain a technical challenge. The
development and debugging of parallel programs on a
xi
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 12/31
cluster requires parallel language primitives as well as
suitable tools.
3.0 Everything to know about Big Data
3.0.1 What is Big Data?
In information technology, “Big data refers to the datasets
whose size is beyond the ability of a typical database software
tools to capture, store, manage and analyze [8].”
O’Reilly defines big data the following way: “Big data is data
that exceeds the processing capacity of conventional database
systems. The data is too big, moves too fast, or doesn't fit the
strictures of your database architectures.”[6]
3.0.2What does the Big Data look like?
As a catch-all term, "Big Data" can be pretty nebulous. Inputdata to big data systems could be chatter from social networks,
xii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 13/31
web server logs, traffic flow sensors, satellite imagery,
broadcast audio streams, banking transactions, MP3s of rock
music, the content of web pages, scans of government
documents, GPS trails, telemetry from automobiles, financialmarket data, the list goes on. Are these all really the same
thing?
To clarify matters, the three Vs of volume, velocity and
variety are commonly used to characterize different aspects of
big data. They're a helpful lens through which to view and
understand the nature of the data and the software platforms
available to exploit them. Most probably you will contend witheach of the Vs to one degree or another.
1. Volume: Many factors contribute to the increase in data
volume – transaction-based data stored through the
years, text data constantly streaming in from social
media, increasing amounts of sensor data being
collected, etc. This volume presents the most immediate
challenge to conventional IT structures. It calls for
scalable storage, and a distributed approach to querying.
Many companies already have large amounts of
archived data, perhaps in the form of logs, but not the
capacity to process it.
2. Velocity: The importance of data's velocity — the
increasing rate at which data flows into an organization
— has followed a similar pattern to that of volume.
Problems previously restricted to segments of industry
are now presenting themselves in a much broader
setting. Specialized companies such as financial traders
have long turned systems that cope with fast moving
data to their advantage. Why is that so? The Internet and
mobile era means that the way we deliver and consumer
products and services is increasingly instrumented,
generating a data flow back to the provider.Those who
xiii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 14/31
are able to quickly utilize that information, by
recommending additional purchases, for instance, gain
competitive advantage. It's not just the velocity of the
incoming data that's the issue: it's possible to streamfast-moving data into bulk storage for later batch
processing, for example. The importance lies in the
speed of the feedback loop, taking data from input
through to decision.[6]
3. Variety: Rarely does data present itself in a form
perfectly ordered and ready for processing. A common
theme in big data systems is that the source data is
diverse, and doesn't fall into neat relational structures. It
could be text from social networks, image data, a raw
feed directly from a sensor source. None of these things
come ready for integration into an application. Even on
the web, where computer-to-computer communication
ought to bring some guarantees, the reality of data is
messy. A common use of big data processing is to take
unstructured data and extract ordered meaning, for
consumption either by humans or as a structured input to
an application.
3.0.3 The value of Big Data
Big data is more than simply a matter of size; it is an
opportunity to find insights in new and emerging types of data
and content, to make your business more agile, and to answer
questions that were previously considered beyond your reach.
The value of Big Data is increasing day by day. This
infographic by Wipro shows the value of Big Data in some
sectors[7]
xiv
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 15/31
Fig 3.Value of big Data-Wipro infographic
Until now, there was no practical way to harvest this
opportunity. Today managing Big Data can be done effectively
and with an ease with emergence of state of art solutions like
Apache Hadoop.
xv
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 16/31
3.1 Apache Hadoop with HDFS
& some implementations
Apache Hadoop is an open source Java framework for
processing and querying vast amounts of data on large clusters
of commodity hardware.[8]
It is an open Source Apache Project initiated and led by
Yahoo.It enables applications to work with thousands of
computation-independent computers and petabytes of data.
Hadoop was derived from Google's Map-Reduce and Google
File System (GFS) papers. The entire Apache Hadoop
“platform” is now commonly considered to consist of the
Hadoop kernel, Map-Reduce and HDFS, as well as a number of
related projects – including Apache Hive, Apache Hbase, and
others.
Hadoop was invented by Doug Cutting and funded by Yahoo in
2006 and reached to its “web scale capacity” in 2008[9].
The HDFS or the Hadoop Distributed File system is one of the
paradigm of Hadoop framework other than the Map-Reduce.
3.1.1 Hadoop Distributed File Sytem (HDFS)
The Hadoop Distributed File System (HDFS) is a distributedfile system designed to run on commodity hardware. It has
many similarities with existing distributed file systems.
However, the differences from other distributed file systems are
significant. HDFS is highly fault-tolerant and is designed to be
deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for
applications that have large data sets.
xvi
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 17/31
3.1.1.1 Features of HDFS [10]
• Highly fault-tolerant: Hardware failure is the norm
rather than the exception. An HDFS instance may
consist of hundreds or thousands of server machines,
each storing part of the file system’s data. The fact that
there are a huge number of components and that each
component has a non-trivial probability of failure means
that some component of HDFS is always non-functional.
Therefore, detection of faults and quick, automatic
recovery from them is a core architectural goal of
HDFS.
• Suitable for applications with large data sets:
Applications that run on HDFS have large data sets. A
typical file in HDFS is gigabytes to terabytes in size.
Thus, HDFS is tuned to support large files. It provides
high aggregate data bandwidth and scale to thousands of
nodes in a single cluster. It should support tens of
millions of files in a single instance.
• Streaming access to file system data: Applications that
run on HDFS need streaming access to their data sets.
They are not general purpose applications that typically
run on general purpose file systems. HDFS is designed
more for batch processing rather than interactive use by
users. The emphasis is on high throughput of data access
rather than low latency of data access.
• Portability across Heterogeneous Hardware and
Software Platforms: HDFS has been designed to
be easily portable from one platform to another.
This facilitates widespread adoption of HDFS as a
platform of choice for a large set of applications.
3.1.1.2 Architecture of HDFS
xvii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 18/31
Master/slave architecture. HDFS cluster consists of a single
Namenode, a master server that manages the file system
namespace and regulates access to files by HDFS clients.Thereare a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they
run on. HDFS exposes a file system namespace and allows user
data to be stored in files. A file is split into one or more blocks
and set of blocks are stored in DataNodes.
NameNode: Keeps image of entire file system
namespace and file Blockmap in memory. When the
Namenode starts up it gets the FsImage and Editlog
from its local file system, update FsImage with EditLog
information and then stores a copy of the FsImage on
the filesytstem as a checkpoint.Periodic checkpointing is
done, so that the system can recover back to the last
checkpointed state in case of a crash[11].
DataNodes: A Datanode stores data in files in its local
file system. Datanode has no knowledge about HDFS
filesystem. It stores each block of HDFS data in a
separate file. The DataNodes are responsible for serving
read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion,
and replication upon instruction from the NameNode.
When the filesystem starts up it generates a list of all
HDFS blocks and send this report to Namenode:
Blockreport[11].
xviii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 19/31
Fig 4: HDFS Architecture
3.1.1.3 Filesystem Namespace
HDFS supports a traditional hierarchical file organization. A
user or an application can create directories and store files
inside these directories. The file system namespace hierarchy is
similar to most other existing file systems; one can create and
remove files, move a file from one directory to another, or
rename a file. The NameNode maintains the file system
namespace. Any change to the file system namespace or its
properties is recorded by the NameNode. An application can
specify the number of replicas of a file that should be
maintained by HDFS. The number of copies of a file is called
the replication factor of that file. This information is stored by
the NameNode[11].
3.1.1.4 Data Organization and replication [10]
HDFS supports write-once-read-many semantics on files. A
typical block size used by HDFS is 64 MB. Thus, an HDFS file
xix
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 20/31
is chopped up into 64 MB chunks, and if possible, each chunk
will reside on a different DataNode.
HDFS is designed to reliably store very large files acrossmachines in a large cluster. It stores each file as a sequence of
blocks; all blocks in a file except the last block are the same
size. The blocks of a file are replicated for fault tolerance. The
block size and replication factor are configurable per file. An
application can specify the number of replicas of a file. The
NameNode makes all decisions regarding replication of blocks.
It periodically receives a Heartbeat and a Blockreport from each
of the DataNodes in the cluster. Receipt of a Heartbeat implies
that the DataNode is functioning properly[11].
3.1.2 Some Implementations of Hadoop
Some of the big corporations around the world have
implemented Hadoop for their own use.These corporationsinclude Yahoo,IBM,Facebook,Google Dell,Oracle,Cloudera
and so on.These major corporations obviously have understood
the might of Hadoop.
3.1.2.1 The Oracle Implementation
1)Feeding Hadoop Data to the Database for Further
Analysis[12]
External tables present data stored in a file system in a table
format, and can be used in SQL queries transparently. Hadoop
data stored in HDFS can be accessed from inside the Oracle
Database by using External Tables through the use of FUSE
(File system in User Space) Using the External Table makes it
easier for non-programmers to work with Hadoop data from
inside an Oracle Database.
xx
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 21/31
Fig 5. Feeding Hadoop Data to the Oracle
Database
2)Drive Enterprise-Level Operational Efficiency of Hadoop
Infrastructure with Oracle Grid Engine[12]
All the computing resources allocated to a Hadoop cluster are
used exclusively by Hadoop, which can result in havingunderutilized resources when Hadoop is not running. Oracle
Grid Engine enables a cluster of computers to run Hadoop with
other data-oriented compute application models.The benefit of
this approach is that you don’t have to maintain a dedicated
cluster for running only Hadoop applications.
Fig 6. Oracle Grid Engine
xxi
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 22/31
3.1.2.2 The Dell Implentation
1)Hadoop design implementation[13]
The representation is broken down into the Hadoop use cases
such as Compute, Storage, and Database workloads.Each
workload has specific characteristics for operations,
deployment, architecture, and management.
Fig 7. Dell Hadoop
design.
2)Network implementation[13]
Top-of-rack (ToR) switches in a network architecture connect
directly to the DataNodes and allow for all inter-node
communication within the Hadoop environment. Hadoop
networks should utilize inexpensive components that are
employed in a way that maximizes performance for DataNode
communication.
xxii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 23/31
Fig 8.Dell Network
implementation
3)Performance Benchmarks[13]
Within the Hadoop software ecosystem, there are several
benchmark tools included that can be used for these
comparisons.
1. Teragen
Utilizes the parallel framework within Hadoop to
quickly create large data sets that can be manipulated.
2. Terasort
Read the data created by Teragen into the system’s
physical memory and then sort it and write it back out to
the HDFS.
3. Teravalidate
Ensures that the data produced by Terasort is accurate
without any errors
xxiii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 24/31
4.0 Advantages and Limitations
4.0.1 Advantages :
• Distribute data and computation. Data Locality principle
avoids network overload.
• Tasks are independent[14]
• So it’s easy to handle partial failures - entire
nodes can fail and restart without shutting down
entire system.
• Avoid crawling horrors of failure-tolerant
synchronous distributed systems.
• Linear scaling in the ideal case
• Designed for cheap, commodity hardware.
• Simple programming model. The “end-user” programmer
only writes map-reduce tasks and the overhead of
programming other tasks is reduced.
• Flat Scalability: One of the major benefits of using Hadoop
in contrast to other distributed systems is its flat scalability
curve. A program written in distributed frameworks other
than Hadoop may require large amounts of refactoring when
scaling from ten to one hundred or one thousand machines.
After a Hadoop program is written and functioning on ten
xxiv
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 25/31
nodes, very little--if any--work is required for that same
program to run on a much larger amount of hardware.[14]
4.0.2 Limitations:
• Programming model is very restrictive. Lack of central
data can be frustrating.
• Still rough - software under active development. e.g.
HDFS only recently added support for append
operations [14]
• “Joins” of multiple datasets are tricky and slow. Often,
entire dataset gets copied in the process
• Cluster management is hard (debugging, distributing
software, collecting logs... is hard)
• Still single master, which requires care and may limit
scaling
• Managing job flow isn’t trivial when intermediate data
should be kept.
• Multiple copies of already big data are created.[15]
• Limited SQL support.[15]
• Inefficient execution as HDFS has no notion of query
optimizer.[15]
• Lack of required skills necessary for handling Hadoop.
[15]
xxv
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 26/31
5.0 Application of Hadoop
Hadoop is leading choice of the companies when it comes in
managing Big Data. Some of the major applications and
companies that use Hadoop are [16]
• IBM offers InfoSphere Big Insights based on.
• Facebook uses Hadoop to store copies of internal log and
dimension data sources and use it as a source for
reporting/analytics and machine learning.
Currently Facebook has 2 major clusters.
• EBay uses
o Heavy usage of Java Map-Reduce for Search
optimization and Research.
• Yahoo! has
o More than 100,000 CPUs in >40,000 computers
running Hadoop
xxvi
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 27/31
o Its biggest cluster: 4500 nodes (2*4cpu boxes with
4*1TB disk & 16GB RAM)
• Other companies include Adobe, Amazon, The New York
Times, Hewlett-Packard and list keeps on increasing.
6.0 Future Scope
Hadoop is growing really fast and so its use in industries that
are gearing up of Big Data Challenge. The Future of Hadoop
definately involves [9]
• Better Scheduling among the nodes for better resource
allocation and control of resources.
• Splitting Core into sub-projects like Apache Hive, Apache
Cassandra, PIG etc.
• Improved API for developers in Java.
• HDFS security.
• Upgrading to Hadoop 1.0.
xxvii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 28/31
7.0 Conclusion
Apache Hadoop is 100% open source, and pioneered a
fundamentally new way of storing and processing data. Instead
of relying on expensive, proprietary hardware and different
systems to store and process data, Hadoop enables distributed
parallel processing of huge amounts of data across inexpensive,
industry-standard servers that both store and process the data,
and can scale without limits. With Hadoop, no data is too big.
This report focuses on the various properties, challenges and
opportunities of Big Data.It also discusses how we can use
Apache Hadoop to manage the Big Data and extract value out
of it. A detailed discussion has been done on Hadoop and its
underlying technology. So in today’s hyper-connected worldwhere more and more data is being created every day, the need
xxviii
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 29/31
for Hadoop is no longer a question. The only question now is
how to take advantage of it best.
References
[1] Randal E. Bryant, Randy H. Katz, Edward D. Lazowska,
“Big-Data Computing: Creating revolutionary
breakthroughs in commerce, science, and society”, Version
8: December 22, 2008. Available:
http://www.cra.org/ccc/docs/init/Big_Data.pdf
[2] What is Big Data, Bringing Big data to enterprise
[Online] Available:http://www-
01.ibm.com/software/data/bigdata/
[3] A Comprehensive List of Big Data Statistics [Online].
Available:http://wikibon.org/blog/big-data-statistics/
xxix
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 30/31
[4] D.A. Bader and R. Pennington, ``Cluster Computing:
Applications,'' The International Journal of High
Performance Computing, 15(2):181-185, May
2001.Available:http://www.cc.gatech.edu/~bader/papers/ijh pca.pdf
[5] K. Shirahata, “Hybrid Map Task Scheduling for GPU-
Based Heterogeneous Clusters” in: Cloud Computing
Technology and Science (CloudCom), 2010 Nov.30 2010-
Dec.3 2010 pages 733 – 740 [Online].Available:
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?
arnumber=5708524
[6]What Is Big Data? O’Reilly Radar, January11, 2012,
[Online].Available: http://radar.oreilly.com/2012/01/what-
is-big-data.html
[7]BigData,Wipro,
[Online].Available:http://www.slideshare.net/wiprotechnolo
gies/wipro-infographicbig-data
[8]Hadoop at Yahoo!, Yahoo developer Network
[Online].Available:
http://developer.yahoo.com/hadoop/
[9] Owan o maley,”Introduction
to Hadoop”[Online].Available: http://wiki.apache.org/hado
op/
[10] HDFS Architecture, Hadoop 0.20 Documentation
[Online].
Available:http://hadoop.apache.org/docs/r0.20.2/hdfs_desig
n.html
[11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler,” The Hadoop Distributed File System
”
xxx
7/27/2019 Hadoop & HDFS Final
http://slidepdf.com/reader/full/hadoop-hdfs-final 31/31
,Available:http://storageconference.org/2010/Papers/MSST/
Shvachko.pdf
[12] Oracle and Hadoop Overview [Online] Available:
http://www.oracle.com/technetwork/database/bi-
datawarehousing/twp-hadoop-oracle-194542.pdf
[13] Introduction to Hadoop – Dell [Online]
Available:http://i.dell.com/sites/content/business/solutions/
whitepapers/en/Documents/hadoop-introduction.pdf
[14] Yahoo! Hadoop tutorial,Yahoo Developer
Network(YDN),[Online].
Available:http://developer.yahoo.com/hadoop/tutorial/modul
e1.html#comparison
[15] Hadoop's Limitations for Big Data Analytics ,
ParAccel.Inc [On
line]
Available:http://www.paraccel.com/resources/Whitepapers/
Hadoop-Limitations-for-Big-Data-ParAccel-Whitepaper.pdf
[16]Hadoop Wiki, Powered By,[Online]
Available: http://wiki.apache.org/hadoop/PoweredBy#I