single node cluster using hadoop
Post on 14-Apr-2018
236 Views
Preview:
TRANSCRIPT
-
7/30/2019 Single Node cluster Using Hadoop
1/30
Cloud computing using
HadoopRahul Poddar 11500110119
Santosh Kumar 11500110006
Shubham Raj 11500110054
Vinayak Raj 11500110019
6th semester CSE-B BPPIMT
-
7/30/2019 Single Node cluster Using Hadoop
2/30
Outline
Briefintroduction
of CloudComputing
Requirementsfor thisproject
What isHadoop and
itsproperties
What led todevelopmentof Hadoop?
MapReduce
HDFS
An exampleapplication on
Hadoop
-
7/30/2019 Single Node cluster Using Hadoop
3/30
What is cloud computing
Cloud computing is the use ofcomputing resources (hardware and
software) that are delivered as a serviceover a network (typically the Internet).
The Cloud aims to cut costs, and helpthe users focus on their core business
instead of being impeded by IT obstacles
The main enabling technologies for CloudComputing are virtualization and
autonomic computing.
-
7/30/2019 Single Node cluster Using Hadoop
4/30
With cloud computing othercompanies host your computers
-
7/30/2019 Single Node cluster Using Hadoop
5/30
Cloud Computing Architecture
Software as a service(SaaS)
Platform as a service(PaaS)
Infrastructure as a service(IaaS)
These three services encapsulate the basic
component of cloud computing.
-
7/30/2019 Single Node cluster Using Hadoop
6/30
Software requirements for Hadoopproject
Java Requirements:Hadoop is a Java-based
system. Recentversions of Hadoop
require Sun Java 1.6.
Operating System:Linux, Ubuntu 12.04
LTS version, Mac OS X.Can also be run in
Windows, but Windowsrequires Cygwin to be
installed.
Installing Hadoop:
Hadoop 1.0.3 or aboveinstalled(either singlenode or multi node).
-
7/30/2019 Single Node cluster Using Hadoop
7/30
Hardware requirements forHadoop(Small cluster 5-50 nodes)
Hadoop and Hbase requires two types of machines:
1)Master(the HDFS NameNode, the MapReduce JobTracker, andthe HBase Master))
2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, ,and the HBase RegionServers)
Two quad core CPUs
12 GB to 24 GB memory and 1 GB RAM.
-
7/30/2019 Single Node cluster Using Hadoop
8/30
Here comes Hadoop
Hadoop is a scalable fault tolerantgrid operating system for data
storage and processing.
Its scalability comes from the combo of:
HDFS: Self healing, high bandwidth Clusteredstorage
MapReduce: Fault tolerant Distributed
processing
Operates on structured andunstructured data
-
7/30/2019 Single Node cluster Using Hadoop
9/30
Here comes Hadoop
A large and active ecosystem(manydevelopers and additions like
Hbase,Pig,Hive)
Open source under the Apache License
http://wiki.apache.org/hadoop/
http://wiki.apache.org/hadoop/http://wiki.apache.org/hadoop/ -
7/30/2019 Single Node cluster Using Hadoop
10/30
Characteristics of Hadoop
Commodity HW
Addinexpensive
servers
Use replicationacross servers to
deal with unreliablestorage/servers
Support for movingcomputation close
to data
Servers have 2purposes: data
storage andcomputation
-
7/30/2019 Single Node cluster Using Hadoop
11/30
Need for Hadoop:Big data
We live in the age of very large andcomplex data called the BIG DATA.
IDC estimates that the total size ofdigital universe is 1.8 zettabytes
which is equal to 1021 bytes.
That equals to each person of thisworld having one hard disk drive.
-
7/30/2019 Single Node cluster Using Hadoop
12/30
Need for Hadoop:Big data
Every day 2.5 quintillions(2.5 x 1018)bytes ofdata is being generated .
90% of the total world data has beengenerated in just 2 years alone.
Such a large amount of ever increasing datais getting difficult for traditional RDBMS and
grid computing systems to manage.
-
7/30/2019 Single Node cluster Using Hadoop
13/30
Sources of Big data
The New York Stock Exchange generates about oneterabyte of new trade data per day.
Facebook hosts approximately 10 billion photostaking 1 petabyte of storage.
The Large Hadron Collider at CERN, Genevaproduces about 15 million petabytes of data per
year.
The Internet Archive stores around 2 petabytes ofdata, and is growing at a rate of 20 terabytes per
month.
-
7/30/2019 Single Node cluster Using Hadoop
14/30
Inefficiency and high expenses
High expenses of high end serverscomputers and other proprietary
hardware and softwares for processingand storage of large amount of data aswell as their maintenance cost isunbearable for many industrialorganisations. Also upgradation andmaintenance to scale up the capacity ofthese servers require huge cost .
-
7/30/2019 Single Node cluster Using Hadoop
15/30
Not Robust
The traditional single server architectureis not a robust architecture because a
large single computer is taking care of allthe computing.If it fails or shutdownsthen whole system breaks down and hugelosses are incurred by the enterprises
.Also during repairing or upgradationcomputer has to switch off and inmeantime no useful tasks are executedresulting in lagging of computations.
-
7/30/2019 Single Node cluster Using Hadoop
16/30
MapReduce algorithm
MapReduce is a programming model for processing large data sets and
typically used to do distributed computing on clusters of computers.
MapReduce provides regular programmers the ability to produce paralleldistributed programs much more easily.
MapReduce consists of two simple functions:
map()
reduce()
-
7/30/2019 Single Node cluster Using Hadoop
17/30
MapReduce algorithm
"Map" step: The master nodetakes the input, divides it into
smaller sub-problems, and
distributes them to worker nodes.
A worker node may do this againin turn, leading to a multi-level
tree structure.
The worker node processes the
smaller problem, and passes theanswer back to its master node.
-
7/30/2019 Single Node cluster Using Hadoop
18/30
MapReduce algorithm
"Reduce" step: The masternode collects the answers toall the sub-problems from
slaves
Then the master combinesthe answers in some way to
form the output the answerto the problem it was
originally trying to solve.
-
7/30/2019 Single Node cluster Using Hadoop
19/30
MapReduce: High Level
JobTrackerMapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
-
7/30/2019 Single Node cluster Using Hadoop
20/30
Some MapReduce Terminology
JobA full program- an execution of a
Mapper and Reduceracross a data set
TaskAn executionof a Mapper or aReducer on a slice of
dataa.k.a. Task-In-Progress(TIP)
Task Attempt Aparticular instance of
an attempt to executea task on a machine
-
7/30/2019 Single Node cluster Using Hadoop
21/30
Terminology Example
Running WordCount across 20
files is onejob
20 files to bemapped imply 20map tasks + some
number ofreducetasks
At least 20 maptask attempts willbe performed
more if a machinecrashes, etc.
-
7/30/2019 Single Node cluster Using Hadoop
22/30
HDFS(Hadoop Distributed FileSystem)
The Hadoop Distributed File System (HDFS) is adistributed file system designed to run on
commodity hardware.
HDFS is highly fault-tolerant and is designed to bedeployed on low-cost hardware.
HDFS provides high throughput access toapplication data and is suitable for applications that
have large data sets.
HDFS is part of the Apache Hadoop project, whichis part of the Apache Lucene project.
-
7/30/2019 Single Node cluster Using Hadoop
23/30
-
7/30/2019 Single Node cluster Using Hadoop
24/30
HDFS Architecture
Master-Slavearchitecture
Manages the filesystem namespace
Maintain file name to list blocks + location mapping
Manages block allocation/replication
Checkpoints namespace and journals namespace changes for reliability
Control access to namespace
DFS MasterNamenode
Stores blocks using the underlying OSs files
Clients access the blocks directly from datanodes
Periodically sends block reports to Namenode
Periodically check block integrity
DFS SlavesDatanodes handle
block storage
-
7/30/2019 Single Node cluster Using Hadoop
25/30
-
7/30/2019 Single Node cluster Using Hadoop
26/30
An Example:Weather Data Mining
Weather sensors all across the globe are collectingclimatic data.
The data can be used from National Climatic DataCentre(http://www.ncdc.noaa.gov/)
We will focus only on temperature for simplicity
The input will be data from NCDC which will given askey-value pair to map()
The output given by reduce() will be the maximumtemperature of each year.
http://www.ncdc.noaa.gov/http://www.ncdc.noaa.gov/ -
7/30/2019 Single Node cluster Using Hadoop
27/30
Weather Data Mining
Mapper.py:#!/usr/bin/env python
import reimport sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92],val[92:93])If (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
-
7/30/2019 Single Node cluster Using Hadoop
28/30
Weather Data Mining
Reduce.py:#!/usr/bin/env pythonimport sys
(last_key, max_val) = (None, 0)for line in sys.stdin:(key, val) = line.strip().split("\t")if last_key and last_key != key:print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))else:(last_key, max_val) = (key, max(max_val, int(val)))if last_key:print "%s\t%s" % (last_key, max_val)
-
7/30/2019 Single Node cluster Using Hadoop
29/30
Running the program
To run a test:
% cat input/ncdc/sample.txt |src/main/ch02/python/max_temperature_map.py | \
sort |src/main/ch02/python/max_temperature_reduce.py
Output:
1949 111
1950 22
-
7/30/2019 Single Node cluster Using Hadoop
30/30
References
Hadoop Wikihttp://hadoop.apache.org/core/
http://wiki.apache.org/hadoop/GettingStartedWithHadoop
http://wiki.apache.org/hadoop/HadoopMapReduce
http://hadoop.apache.org/core/docs/current/hdfs_design.html
http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/
top related