everything you should know about hadoop

7/27/2019 Everything You Should Know About Hadoop

1/4

Everything you should know about HadoopThe Necessity: Have you ever imagined the amount of data that an organizationlike the Indian Railways would generate of its various transactions per day! And howmuch would it accumulate over the end of a month! And yet further the massiveamount of data that would amass at the end of fiscal year!

Or have you cared for those multinational telecommunication carriers who maintain

data for the huge number of its clients and their transactions per day which may

include maintaining daily logs of the calls received and made, SMSs received and

send, billing details per transaction, maintaining registration details of each

customer so on and so forth. (And do not forget that in a country like India there are

more mobile phones and its users than dedicated toilets and with cheap rates the

count customers is not going to dwindle in the near future!)

Quite obviously, all such enterprise endeavours generate dinosaurous data and

handling it would certainly be one hard nut to crack. As if this is not enough, to

further compound the challenge to process such huge chunks of data, the hardware

might also fail without giving you any indication whatsoever. Quite conceivably, any

hardware malfunction has the potential to stymie the entire process bringing

everything to a numb halt.

And thus handling and processing massive data in an environment where the

hardware reliability is a serious question mark, Apache Software Foundation came

out with a solution addressing both the issues - the issue of handling and processing

massive data even in the scenario of hardware malfunction - they named the solution

as Apache Hadoop!

Introduction to Hadoop


2/4

As with most other software products that Apache Software Foundation (ASF) has

released, Apache Hadoop is also an open source software framework. It has the

ingredient to handle and process large scale data. To handle large scale data, Hadoop

does not mandate supercomputers rather it advocates usage of cluster of everyday,

commodity hardware.

Furthermore, since hardware malfunction is an accepted norm, Hadoop readily

assumes at its architectural level that while computing data, hardware might go

wrong, nevertheless, all such irregularities must be detected and handled in the

software by the framework.

Therefore, it is noticeable that Hadoop does not merely takes the distributed

computing to the next level but also incorporates intelligence to even handle bizarre

hardware malfunctioning. And all of these would be done on the fly!

Hadoop Framework CompositionIn order to materialize the objectives, at its core, Hadoop framework is composed of

four synergistically working modules. They are:

1.

Hadoop Common: Officially this module is the library of commonutility support that the other modules would need to sufficiently run theirrequirements. Thus, Hadoop Common constitutes the very basic foundationwhich would be used by other modules. Hadoop Common can be correlated tothe JRE library features of Java.

2. Hadoop Distributed File System (HDFS):HDFS is the keymodule for distributing the work load on a cluster of commodity hardwarewhile maintaining very high overall bandwidth through the cluster ofmachines. Nevertheless, the conceptualization of HDFS was derived from thespecification of Google File System (GFS).

3. Hadoop YARN:To understand Hadoop YARN another module orsub framework in the Apache Hadoop project we would compare it with thatpart of the operating system which schedules different tasks for themicroprocessor and at the same time manages the resources required by themicroprocessor for efficient discharge of the job.

Thus, Hadoop YARN schedules tasks and manages resources for variouscommodity machines in the cluster.


3/4

4. Hadoop MapReduce: Remember we said that Hadoop can veryefficiently manage large data sets by the virtue of its prudent usage ofcommodity hardwares, well, MapReduce which is a YARN-based system forparallel processing programming model does helps exactly in materializingthe same! Like HDFS, MapReduce is also derived from Googles MapReduce

specification!Hadoop ArchitectureAt its core, the Hadoop architecture comprises of three main components, viz.

The Hadoop Common package MapReduce Engine Hadoop Distributed File SystemIn passages that follow we describe each component separately.

The Hadoop Common packageAs we indicated above that Hadoop distributes its load in a cluster of several

commodity machines running simultaneously. Consequently, for effective delegation

of work, every Hadoop-compatible file system locates where the worker node is

running in the cluster of machines. Using this whereabouts, Hadoop applications

can run the work on the node where the data is offered.

Thus, the Hadoom Common package offers support for dealing with underlying

Operating System and their File System structure for the various machines in the

cluster. Besides this package also contains the necessary Java Archive (jar) files to

start the Hadoop software! And you would have to frequent the same package in case

you need the documentation and source code of Hadoop itself!

The MapReduce Engine could be either MapReduce/MR1 or YARN/MR2 which

basically empowers the software to handle massive data sets for efficient processing.

Thus the MapReduce Engine could be reckoned as one of the main work horse

behind the Hadoop Software. The MapReducer consist of JobTracker to the client

applications submit the task to be performed. Upon successful receipt of the task, it

is forwarded to the TaskTracker based worker nodes in the cluster. However is the

TaskTracker fails or times out, the job would be rescheduled again.


4/4

While delegating task to the worker note, the TaskTracker releases a new JVM (Java

Virtual Machine) process for the task on the worker note. Thus, if the running task

crashes the JVM, only the dedicated JVM for the node would be crashed, not theJVM in which the TaskTracker itself was running! This is one of the most fascinating

aspects of the MapReduce Engine.

Even during the flow of the task in the worker note, the TaskTracker and the

JobTracker continuously interact with each other. The running status at the worker

node could be observed through this interaction on the web browser via the Jetty

server.

And as we alluded earlier, the Hadoop Distributed File System (HDFS) provides the

platform or the software infrastructure required for distributing the work load on

several commodity machines in the dedicated cluster.

everything you should know about hadoop

Documents