everything you should know about hadoop
TRANSCRIPT
-
7/27/2019 Everything You Should Know About Hadoop
1/4
Everything you should know about HadoopThe Necessity: Have you ever imagined the amount of data that an organizationlike the Indian Railways would generate of its various transactions per day! And howmuch would it accumulate over the end of a month! And yet further the massiveamount of data that would amass at the end of fiscal year!
Or have you cared for those multinational telecommunication carriers who maintain
data for the huge number of its clients and their transactions per day which may
include maintaining daily logs of the calls received and made, SMSs received and
send, billing details per transaction, maintaining registration details of each
customer so on and so forth. (And do not forget that in a country like India there are
more mobile phones and its users than dedicated toilets and with cheap rates the
count customers is not going to dwindle in the near future!)
Quite obviously, all such enterprise endeavours generate dinosaurous data and
handling it would certainly be one hard nut to crack. As if this is not enough, to
further compound the challenge to process such huge chunks of data, the hardware
might also fail without giving you any indication whatsoever. Quite conceivably, any
hardware malfunction has the potential to stymie the entire process bringing
everything to a numb halt.
And thus handling and processing massive data in an environment where the
hardware reliability is a serious question mark, Apache Software Foundation came
out with a solution addressing both the issues - the issue of handling and processing
massive data even in the scenario of hardware malfunction - they named the solution
as Apache Hadoop!
Introduction to Hadoop
-
7/27/2019 Everything You Should Know About Hadoop
2/4
As with most other software products that Apache Software Foundation (ASF) has
released, Apache Hadoop is also an open source software framework. It has the
ingredient to handle and process large scale data. To handle large scale data, Hadoop
does not mandate supercomputers rather it advocates usage of cluster of everyday,
commodity hardware.
Furthermore, since hardware malfunction is an accepted norm, Hadoop readily
assumes at its architectural level that while computing data, hardware might go
wrong, nevertheless, all such irregularities must be detected and handled in the
software by the framework.
Therefore, it is noticeable that Hadoop does not merely takes the distributed
computing to the next level but also incorporates intelligence to even handle bizarre
hardware malfunctioning. And all of these would be done on the fly!
Hadoop Framework CompositionIn order to materialize the objectives, at its core, Hadoop framework is composed of
four synergistically working modules. They are:
1.
Hadoop Common: Officially this module is the library of commonutility support that the other modules would need to sufficiently run theirrequirements. Thus, Hadoop Common constitutes the very basic foundationwhich would be used by other modules. Hadoop Common can be correlated tothe JRE library features of Java.
2. Hadoop Distributed File System (HDFS):HDFS is the keymodule for distributing the work load on a cluster of commodity hardwarewhile maintaining very high overall bandwidth through the cluster ofmachines. Nevertheless, the conceptualization of HDFS was derived from thespecification of Google File System (GFS).
3. Hadoop YARN:To understand Hadoop YARN another module orsub framework in the Apache Hadoop project we would compare it with thatpart of the operating system which schedules different tasks for themicroprocessor and at the same time manages the resources required by themicroprocessor for efficient discharge of the job.
Thus, Hadoop YARN schedules tasks and manages resources for variouscommodity machines in the cluster.
-
7/27/2019 Everything You Should Know About Hadoop
3/4
4. Hadoop MapReduce: Remember we said that Hadoop can veryefficiently manage large data sets by the virtue of its prudent usage ofcommodity hardwares, well, MapReduce which is a YARN-based system forparallel processing programming model does helps exactly in materializingthe same! Like HDFS, MapReduce is also derived from Googles MapReduce
specification!Hadoop ArchitectureAt its core, the Hadoop architecture comprises of three main components, viz.
The Hadoop Common package MapReduce Engine Hadoop Distributed File SystemIn passages that follow we describe each component separately.
The Hadoop Common packageAs we indicated above that Hadoop distributes its load in a cluster of several
commodity machines running simultaneously. Consequently, for effective delegation
of work, every Hadoop-compatible file system locates where the worker node is
running in the cluster of machines. Using this whereabouts, Hadoop applications
can run the work on the node where the data is offered.
Thus, the Hadoom Common package offers support for dealing with underlying
Operating System and their File System structure for the various machines in the
cluster. Besides this package also contains the necessary Java Archive (jar) files to
start the Hadoop software! And you would have to frequent the same package in case
you need the documentation and source code of Hadoop itself!
The MapReduce Engine could be either MapReduce/MR1 or YARN/MR2 which
basically empowers the software to handle massive data sets for efficient processing.
Thus the MapReduce Engine could be reckoned as one of the main work horse
behind the Hadoop Software. The MapReducer consist of JobTracker to the client
applications submit the task to be performed. Upon successful receipt of the task, it
is forwarded to the TaskTracker based worker nodes in the cluster. However is the
TaskTracker fails or times out, the job would be rescheduled again.
-
7/27/2019 Everything You Should Know About Hadoop
4/4
While delegating task to the worker note, the TaskTracker releases a new JVM (Java
Virtual Machine) process for the task on the worker note. Thus, if the running task
crashes the JVM, only the dedicated JVM for the node would be crashed, not theJVM in which the TaskTracker itself was running! This is one of the most fascinating
aspects of the MapReduce Engine.
Even during the flow of the task in the worker note, the TaskTracker and the
JobTracker continuously interact with each other. The running status at the worker
node could be observed through this interaction on the web browser via the Jetty
server.
And as we alluded earlier, the Hadoop Distributed File System (HDFS) provides the
platform or the software infrastructure required for distributing the work load on
several commodity machines in the dedicated cluster.