everything you should know about hadoop

Upload: qmarkmark2315

Post on 13-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Everything You Should Know About Hadoop

    1/4

    Everything you should know about HadoopThe Necessity: Have you ever imagined the amount of data that an organizationlike the Indian Railways would generate of its various transactions per day! And howmuch would it accumulate over the end of a month! And yet further the massiveamount of data that would amass at the end of fiscal year!

    Or have you cared for those multinational telecommunication carriers who maintain

    data for the huge number of its clients and their transactions per day which may

    include maintaining daily logs of the calls received and made, SMSs received and

    send, billing details per transaction, maintaining registration details of each

    customer so on and so forth. (And do not forget that in a country like India there are

    more mobile phones and its users than dedicated toilets and with cheap rates the

    count customers is not going to dwindle in the near future!)

    Quite obviously, all such enterprise endeavours generate dinosaurous data and

    handling it would certainly be one hard nut to crack. As if this is not enough, to

    further compound the challenge to process such huge chunks of data, the hardware

    might also fail without giving you any indication whatsoever. Quite conceivably, any

    hardware malfunction has the potential to stymie the entire process bringing

    everything to a numb halt.

    And thus handling and processing massive data in an environment where the

    hardware reliability is a serious question mark, Apache Software Foundation came

    out with a solution addressing both the issues - the issue of handling and processing

    massive data even in the scenario of hardware malfunction - they named the solution

    as Apache Hadoop!

    Introduction to Hadoop

  • 7/27/2019 Everything You Should Know About Hadoop

    2/4

    As with most other software products that Apache Software Foundation (ASF) has

    released, Apache Hadoop is also an open source software framework. It has the

    ingredient to handle and process large scale data. To handle large scale data, Hadoop

    does not mandate supercomputers rather it advocates usage of cluster of everyday,

    commodity hardware.

    Furthermore, since hardware malfunction is an accepted norm, Hadoop readily

    assumes at its architectural level that while computing data, hardware might go

    wrong, nevertheless, all such irregularities must be detected and handled in the

    software by the framework.

    Therefore, it is noticeable that Hadoop does not merely takes the distributed

    computing to the next level but also incorporates intelligence to even handle bizarre

    hardware malfunctioning. And all of these would be done on the fly!

    Hadoop Framework CompositionIn order to materialize the objectives, at its core, Hadoop framework is composed of

    four synergistically working modules. They are:

    1.

    Hadoop Common: Officially this module is the library of commonutility support that the other modules would need to sufficiently run theirrequirements. Thus, Hadoop Common constitutes the very basic foundationwhich would be used by other modules. Hadoop Common can be correlated tothe JRE library features of Java.

    2. Hadoop Distributed File System (HDFS):HDFS is the keymodule for distributing the work load on a cluster of commodity hardwarewhile maintaining very high overall bandwidth through the cluster ofmachines. Nevertheless, the conceptualization of HDFS was derived from thespecification of Google File System (GFS).

    3. Hadoop YARN:To understand Hadoop YARN another module orsub framework in the Apache Hadoop project we would compare it with thatpart of the operating system which schedules different tasks for themicroprocessor and at the same time manages the resources required by themicroprocessor for efficient discharge of the job.

    Thus, Hadoop YARN schedules tasks and manages resources for variouscommodity machines in the cluster.

  • 7/27/2019 Everything You Should Know About Hadoop

    3/4

    4. Hadoop MapReduce: Remember we said that Hadoop can veryefficiently manage large data sets by the virtue of its prudent usage ofcommodity hardwares, well, MapReduce which is a YARN-based system forparallel processing programming model does helps exactly in materializingthe same! Like HDFS, MapReduce is also derived from Googles MapReduce

    specification!Hadoop ArchitectureAt its core, the Hadoop architecture comprises of three main components, viz.

    The Hadoop Common package MapReduce Engine Hadoop Distributed File SystemIn passages that follow we describe each component separately.

    The Hadoop Common packageAs we indicated above that Hadoop distributes its load in a cluster of several

    commodity machines running simultaneously. Consequently, for effective delegation

    of work, every Hadoop-compatible file system locates where the worker node is

    running in the cluster of machines. Using this whereabouts, Hadoop applications

    can run the work on the node where the data is offered.

    Thus, the Hadoom Common package offers support for dealing with underlying

    Operating System and their File System structure for the various machines in the

    cluster. Besides this package also contains the necessary Java Archive (jar) files to

    start the Hadoop software! And you would have to frequent the same package in case

    you need the documentation and source code of Hadoop itself!

    The MapReduce Engine could be either MapReduce/MR1 or YARN/MR2 which

    basically empowers the software to handle massive data sets for efficient processing.

    Thus the MapReduce Engine could be reckoned as one of the main work horse

    behind the Hadoop Software. The MapReducer consist of JobTracker to the client

    applications submit the task to be performed. Upon successful receipt of the task, it

    is forwarded to the TaskTracker based worker nodes in the cluster. However is the

    TaskTracker fails or times out, the job would be rescheduled again.

  • 7/27/2019 Everything You Should Know About Hadoop

    4/4

    While delegating task to the worker note, the TaskTracker releases a new JVM (Java

    Virtual Machine) process for the task on the worker note. Thus, if the running task

    crashes the JVM, only the dedicated JVM for the node would be crashed, not theJVM in which the TaskTracker itself was running! This is one of the most fascinating

    aspects of the MapReduce Engine.

    Even during the flow of the task in the worker note, the TaskTracker and the

    JobTracker continuously interact with each other. The running status at the worker

    node could be observed through this interaction on the web browser via the Jetty

    server.

    And as we alluded earlier, the Hadoop Distributed File System (HDFS) provides the

    platform or the software infrastructure required for distributing the work load on

    several commodity machines in the dedicated cluster.