big data processing using hadoop poster presentation

Hadoop : Cloud versus Commodity Hardware

Presenter: Amrut Patil Advisor: Dr. Rajendra K. RajRochester Institute of Technology

Amrut PatilRochester Institute of TechnologyEmail: [email protected]

Contact1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on

Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10,Berkeley, CA, USA, 2004. USENIX Association..

2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co.3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose

References

• Big Data is becoming more commonplace, both in scientific researchand industrial settings.

• Hadoop, a parallelized and distributed storage and processing opensource framework, is gaining increasing popularity to process vastamount of data.

• This project investigates the use of Hadoop for Big Data processing.

• We compare the design and implementation of Hadoopinfrastructure in a cloud setting and on commodity hardware.

Overview

• Set up AWS account and get AWS authentication credentials, namely,Access Key ID, Secret Access Key, X.509 Certificate file,X.509 private key file, AWS account ID

• Set up command line tools to start and stop EC2 instances.• Prepare an SSH key pair: Public key is embedded in the EC2 instance

and private key is on the local machine. Together they establish asecure communication channel.

• Set up Hadoop on EC2 by configuring security parameters(AWSAccount ID, AWS Access Key ID and AWS Secret Access Key) in thesingle initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh.

• To launch a Hadoop cluster on EC2, use:hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves>

• To login to the master node of the cluster, use:hadoop-ec2 login <cluster-name>

• Testing functionality of Hadoop cluster, use:bin/hadoop jar hadoop-*-examples.jar pi 10 10000000

• To shut down a cluster:bin/hadoop-ec2 terminate-cluster <cluster-name>

Hadoop Background

• Verified functionality of the Hadoop cluster by installing and runningHive, a datawarehousing package.

• Accessible: This infrastructure can be set up using commodityhardware and in a cloud setting.

• Scalable: The cluster capacity can be easily increased by adding morenumber of machines.

• Fault Tolerant: In case of failure, it automatically restarts failed jobs• Low Cost: One can quickly and cheaply create their own cluster using

a set of machines.

Conclusions

• Hadoop employs a master/slave architecture for distributed storageand computation.

• The distributed storage system is called the Hadoop File System(HDFS).

Blocks of Hadoop for data processing:• NameNode: Master of HDFS. Monitors how the files are broken

down into file blocks, nodes which store these blocks and directsthe slave datanodes to perform I/O tasks.

• DataNode: Performs the task of reading and writing files from HDFSto local file system.

• Secondary NameNode: Takes snapshot of HDFS metadata after pre-defined intervals of time. Useful to handle fault tolerance.

• Job Tracker: Determines which tasks to process, monitors taskswhile they are running and assigns nodes to tasks.

• Task Tracker: Manages the execution of individual tasks on eachslave node.

• Hadoop uses the MapReduce framework for easily scaling dataprocessing over multiple computing nodes.

Approaches for Implementing Hadoop

• On a Cloud Setting: Utilized Amazon Web Services(AWS)namely, Amazon Elastic Cloud Computer(EC2) and Amazon SimpleStorage Service(S3).

• Using Commodity Hardware: Utilized several old PCs that werebeing retired running Ubuntu 12.04 LTS.

• Choose one specific node which will host the NameNode and JobTracker daemons. This machine also activates the DataNode and TaskTracker daemons on all slave nodes.

• Set up passphraseless SSH for the master to remotely access everynode in the cluster. Public key is stored locally on every node whileprivate key is send by the master node..

• User accounts should have the same name on all nodes.• Generate an RSA keypair on the master node using:

ssh-keygen -t rsa• Copy public key to every slave node as well as master node using:

scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key• Log in to target node from the master::

ssh target• Hadoop configuration settings are contained in three XML files:

core-site.xml, hdfs-site.xml, and mapred-site.xml.• Hadoop can be run in three operational modes:

• Local (Standalone)Mode: Hadoop runs completely on localmachine. HDFS is not used and no Hadoop daemons arelaunched.

• Psuedo-distributed mode: All daemons are running on a singlemachine. Mainly used for development work.

• Fully Distributed mode: Actual Hadoop cluster runs in this mode.• To start Hadoop Daemons: bin/start-all.sh• To stop Hadoop Daemons: bin/stop-all.sh

Hadoop on the Cloud

Common Architecture of Hadoop Cluster

Secondary Name Node

NameNodeJob Tracker

DataNode

Task Tracker

DataNode

Task Tracker

DataNode

Task Tracker

Only 1 Per Cluster

Only 1 Per ClusterMaster

Slave 1

. . . . .

Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with NameNode and JobTracker as Masters and DataNode and TaskTracker

as Slaves

Slave 2 Slave N

Hadoop on Commodity Hardware

big data processing using hadoop poster presentation

Documents