big data processing using hadoop poster presentation

1
Hadoop : Cloud versus Commodity Hardware Presenter: Amrut Patil Advisor: Dr. Rajendra K. Raj Rochester Institute of Technology Amrut Patil Rochester Institute of Technology Email: [email protected] Contact 1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6 th conference on Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10, Berkeley, CA, USA, 2004. USENIX Association.. 2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co. 3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose References Big Data is becoming more commonplace, both in scientific research and industrial settings. Hadoop, a parallelized and distributed storage and processing open source framework, is gaining increasing popularity to process vast amount of data. This project investigates the use of Hadoop for Big Data processing. We compare the design and implementation of Hadoop infrastructure in a cloud setting and on commodity hardware. Overview Set up AWS account and get AWS authentication credentials, namely, Access Key ID, Secret Access Key, X.509 Certificate file, X.509 private key file, AWS account ID Set up command line tools to start and stop EC2 instances. Prepare an SSH key pair: Public key is embedded in the EC2 instance and private key is on the local machine. Together they establish a secure communication channel. Set up Hadoop on EC2 by configuring security parameters(AWS Account ID, AWS Access Key ID and AWS Secret Access Key) in the single initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh. To launch a Hadoop cluster on EC2, use: hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves> To login to the master node of the cluster, use: hadoop-ec2 login <cluster-name> Testing functionality of Hadoop cluster, use: bin/hadoop jar hadoop-*-examples.jar pi 10 10000000 To shut down a cluster: bin/hadoop-ec2 terminate-cluster <cluster-name> Hadoop Background Verified functionality of the Hadoop cluster by installing and running Hive, a datawarehousing package. Accessible: This infrastructure can be set up using commodity hardware and in a cloud setting. Scalable: The cluster capacity can be easily increased by adding more number of machines. Fault Tolerant: In case of failure, it automatically restarts failed jobs Low Cost: One can quickly and cheaply create their own cluster using a set of machines. Conclusions Hadoop employs a master/slave architecture for distributed storage and computation. The distributed storage system is called the Hadoop File System (HDFS). Blocks of Hadoop for data processing: NameNode: Master of HDFS. Monitors how the files are broken down into file blocks, nodes which store these blocks and directs the slave datanodes to perform I/O tasks. DataNode: Performs the task of reading and writing files from HDFS to local file system. Secondary NameNode: Takes snapshot of HDFS metadata after pre- defined intervals of time. Useful to handle fault tolerance. Job Tracker: Determines which tasks to process, monitors tasks while they are running and assigns nodes to tasks. Task Tracker: Manages the execution of individual tasks on each slave node. Hadoop uses the MapReduce framework for easily scaling data processing over multiple computing nodes. Approaches for Implementing Hadoop On a Cloud Setting: Utilized Amazon Web Services(AWS) namely, Amazon Elastic Cloud Computer(EC2) and Amazon Simple Storage Service(S3). Using Commodity Hardware: Utilized several old PCs that were being retired running Ubuntu 12.04 LTS. Choose one specific node which will host the NameNode and Job Tracker daemons. This machine also activates the DataNode and Task Tracker daemons on all slave nodes. Set up passphraseless SSH for the master to remotely access every node in the cluster. Public key is stored locally on every node while private key is send by the master node.. User accounts should have the same name on all nodes. Generate an RSA keypair on the master node using: ssh-keygen -t rsa Copy public key to every slave node as well as master node using: scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key Log in to target node from the master:: ssh target Hadoop configuration settings are contained in three XML files: core-site.xml, hdfs-site.xml, and mapred-site.xml. Hadoop can be run in three operational modes: Local (Standalone)Mode: Hadoop runs completely on local machine. HDFS is not used and no Hadoop daemons are launched. Psuedo-distributed mode: All daemons are running on a single machine. Mainly used for development work. Fully Distributed mode: Actual Hadoop cluster runs in this mode. To start Hadoop Daemons: bin/start-all.sh To stop Hadoop Daemons: bin/stop-all.sh Hadoop on the Cloud Common Architecture of Hadoop Cluster Secondary Name Node NameNode Job Tracker DataNode Task Tracker DataNode Task Tracker DataNode Task Tracker Only 1 Per Cluster Only 1 Per Cluster Master Slave 1 . . . . . Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with NameNode and JobTracker as Masters and DataNode and TaskTracker as Slaves Slave 2 Slave N Hadoop on Commodity Hardware

Upload: amrut-patil

Post on 16-Jul-2015

732 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big data processing using hadoop poster presentation

Hadoop : Cloud versus Commodity Hardware

Presenter: Amrut Patil Advisor: Dr. Rajendra K. RajRochester Institute of Technology

Amrut PatilRochester Institute of TechnologyEmail: [email protected]

Contact1. J. Dean and S. Ghemawat. Mapreduce: simplied data processing on large clusters. In Proceedings of the 6th conference on

Symposium on Operating Systems Design & Implementation - Volume 6, OSDI'04, pages 10-10,Berkeley, CA, USA, 2004. USENIX Association..

2. Lam. Chuck.(2011). Hadoop in Action. Stamford,CT: Manning Publications Co.3. Hadoop 1.1.2 Documentation, http://hadoop.apache.org/docs/stable/cluster_setup.html#Purpose

References

• Big Data is becoming more commonplace, both in scientific researchand industrial settings.

• Hadoop, a parallelized and distributed storage and processing opensource framework, is gaining increasing popularity to process vastamount of data.

• This project investigates the use of Hadoop for Big Data processing.

• We compare the design and implementation of Hadoopinfrastructure in a cloud setting and on commodity hardware.

Overview

• Set up AWS account and get AWS authentication credentials, namely,Access Key ID, Secret Access Key, X.509 Certificate file,X.509 private key file, AWS account ID

• Set up command line tools to start and stop EC2 instances.• Prepare an SSH key pair: Public key is embedded in the EC2 instance

and private key is on the local machine. Together they establish asecure communication channel.

• Set up Hadoop on EC2 by configuring security parameters(AWSAccount ID, AWS Access Key ID and AWS Secret Access Key) in thesingle initialization script at src/contrib/ec2/bin/hadoop- ec2-env.sh.

• To launch a Hadoop cluster on EC2, use:hadoop-ec2 launch-cluster <cluster-name> <number-of-slaves>

• To login to the master node of the cluster, use:hadoop-ec2 login <cluster-name>

• Testing functionality of Hadoop cluster, use:bin/hadoop jar hadoop-*-examples.jar pi 10 10000000

• To shut down a cluster:bin/hadoop-ec2 terminate-cluster <cluster-name>

Hadoop Background

• Verified functionality of the Hadoop cluster by installing and runningHive, a datawarehousing package.

• Accessible: This infrastructure can be set up using commodityhardware and in a cloud setting.

• Scalable: The cluster capacity can be easily increased by adding morenumber of machines.

• Fault Tolerant: In case of failure, it automatically restarts failed jobs• Low Cost: One can quickly and cheaply create their own cluster using

a set of machines.

Conclusions

• Hadoop employs a master/slave architecture for distributed storageand computation.

• The distributed storage system is called the Hadoop File System(HDFS).

Blocks of Hadoop for data processing:• NameNode: Master of HDFS. Monitors how the files are broken

down into file blocks, nodes which store these blocks and directsthe slave datanodes to perform I/O tasks.

• DataNode: Performs the task of reading and writing files from HDFSto local file system.

• Secondary NameNode: Takes snapshot of HDFS metadata after pre-defined intervals of time. Useful to handle fault tolerance.

• Job Tracker: Determines which tasks to process, monitors taskswhile they are running and assigns nodes to tasks.

• Task Tracker: Manages the execution of individual tasks on eachslave node.

• Hadoop uses the MapReduce framework for easily scaling dataprocessing over multiple computing nodes.

Approaches for Implementing Hadoop

• On a Cloud Setting: Utilized Amazon Web Services(AWS)namely, Amazon Elastic Cloud Computer(EC2) and Amazon SimpleStorage Service(S3).

• Using Commodity Hardware: Utilized several old PCs that werebeing retired running Ubuntu 12.04 LTS.

• Choose one specific node which will host the NameNode and JobTracker daemons. This machine also activates the DataNode and TaskTracker daemons on all slave nodes.

• Set up passphraseless SSH for the master to remotely access everynode in the cluster. Public key is stored locally on every node whileprivate key is send by the master node..

• User accounts should have the same name on all nodes.• Generate an RSA keypair on the master node using:

ssh-keygen -t rsa• Copy public key to every slave node as well as master node using:

scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key• Log in to target node from the master::

ssh target• Hadoop configuration settings are contained in three XML files:

core-site.xml, hdfs-site.xml, and mapred-site.xml.• Hadoop can be run in three operational modes:

• Local (Standalone)Mode: Hadoop runs completely on localmachine. HDFS is not used and no Hadoop daemons arelaunched.

• Psuedo-distributed mode: All daemons are running on a singlemachine. Mainly used for development work.

• Fully Distributed mode: Actual Hadoop cluster runs in this mode.• To start Hadoop Daemons: bin/start-all.sh• To stop Hadoop Daemons: bin/stop-all.sh

Hadoop on the Cloud

Common Architecture of Hadoop Cluster

Secondary Name Node

NameNodeJob Tracker

DataNode

Task Tracker

DataNode

Task Tracker

DataNode

Task Tracker

Only 1 Per Cluster

Only 1 Per ClusterMaster

Slave 1

. . . . .

Figure 1: Typical Hadoop Cluster. Master/Slave Configuration with NameNode and JobTracker as Masters and DataNode and TaskTracker

as Slaves

Slave 2 Slave N

Hadoop on Commodity Hardware