hadoop

1

Presented by NIKHIL P L

Apache Hadoop

• Developer(s) : Apache Software Foundation

• Type : Distributed File System• License : Apache License 2.0• Written in : Java• O S : Cross platform• Created by : Doug Cutting (2005)• Inspired by: Google’s MapReduce, GFS

2

3

Sub projects

• HDFS– distributed, scalable, and portable file system– Store large data sets– Cope with hardware failure– Runs on top of the existing system

4

HDFS - Replication

• Blocks with data are replicated to multiple nodes

• Allow for node failure without data loss

5

Sub projects .

• MapReduce– Technology from Google– Hadoop's fundamental data filtering algorithm– Map and Reduce functions– Useful in a wide range of application• distributed pattern-based searching, distributed

sorting, web link-graph reversal, machine learning, statistical machine translation.

6

MapReduce - Workflow

7

Hadoop cluster (Terminology)

8

Types of Nodes

• HDFS nodes– NameNode (Master)– DataNode (Slaves)

• MapReduce nodes– Job Tracker (Master)– Task Tracker (Slaves)

9

Types of Nodes .

10

Sub projects ..

• Hive– providing data summarization, query, and analysis– initially developed by Facebook

• Hbase– open source, non-relational, distributed database– Providing Google BigTable-model database -like

capabilities

11

Sub projects …

• Zookeeper– distributed configuration service, synchronization

services, notification systems and naming registry for large distributed systems.

• Pig– A language and compiler to generate Hadoop

programs– Originally developed at Yahoo!

12

How does Hadoop works? .

• HDFS Works

13

How does Hadoop works? ..

• MapReduce Works

14

How does Hadoop works? …

• MapReduce Works

15

How does Hadoop works? ….

• Managing Hadoop Jobs

16

Applications

• Marketing analytics• Machin learning (eg: spam filters)• Image processing• Processing of XML messages

17

• world's largest Hadoop production application• ~20,000 machines running Hadoop

18

• the largest Hadoop cluster in the world with 100 PB of storage

• 1200 machines with 8 cores each + 800 machines with 16 cores each

• 32 GB of RAM per machine• 65 millions files in HDFS• 12 TB of compressed data added per day

19

Other Users

20

Thanks

hadoop

Technology

hadoop programs

hadoop jobs15

apache hadoop developers

largest hadoop cluster

distributed database

sub projects hdfs

hadoop cluster terminology7

distributed sorting