capital onehadoopclass

Search for a CentOS6.4 x86_64 with updates instance from AWS marketplace. https://aws.amazon.com/marketplace The same search from the AWS Instances link doesn't return the same results.

https://aws.amazon.com/marketplace

Login as root:

[dc@localhost Downloads]$ ssh -i bigtop2.pem [email protected] The authenticity of host 'ec2-50-19-133-93.compute-1.amazonaws.com (50.19.133.93)' can't be established. RSA key fingerprint is ee:90:19:6d:67:44:1e:a8:85:0d:7f:03:35:21:42:8c. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ec2-50-19-133-93.compute-1.amazonaws.com,50.19.133.93' (RSA) to the list of known hosts. [root@ip-10-147-220-207 ~]#

Create a user and password:

[root@ip-10-147-220-207 ~]# useradd dc [root@ip-10-147-220-207 ~]# passwd dc Changing password for user dc. New password: Retype new password: passwd: all authentication tokens updated successfully.

Install an editor as root>yum install nano

Edit sudoers file:## Allow root to run any commands anywhere root ALL=(ALL) ALL dc ALL=(ALL) ALL

Add the line in bold with the user name in bold. Modify to your user name.

Change to the user and go to the user home directory.

[root@ip-10-147-220-207 ~]# su dc [dc@ip-10-147-220-207 root]$ cd [dc@ip-10-147-220-207 ~]$ pwd /home/dc [dc@ip-10-147-220-207 ~]$

Apache Bigtop is the project where all the Hadoop components are combined. This is used as a starting point for Hortonworks and Cloudera to build their distributions.

Follow the instructions on the Apache Bigtop install page here:

https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop+0.6.0

Replace 0.6.0 with 0.7.0 and select Centos6.4 on the bigtop repo install command.

sudo wget -O /etc/yum.repos.d/bigtop.repo http://www.apache.org/dist/bigtop/bigtop-0.7.0/repos/centos6/bigtop.repo

this copies the bigtop.repo file into /etc/yum.repos.d/bigtop.repo

run >sudo yum update

[dc@ip-10-147-220-207 ~]$ sudo yum install hadoop\*

Install a JDK and set JDK_HOME. The old versions of Hadoop required JDK1.6-X. There are patches you can apply to upgrade to JDK1.7-x.

>sudo wget https://s3.amazonaws.com/victormongo/jdk-6u25-linux-x64.bin.1

[dc@ip-10-147-220-207 ~]$ ls jdk-6u25-linux-x64.bin.1 [dc@ip-10-147-220-207 ~]$ sudo chmod 777 jdk-6u25-linux-x64.bin.1 [dc@ip-10-147-220-207 ~]$ ls jdk-6u25-linux-x64.bin.1 [dc@ip-10-147-220-207 ~]$

run the executable: ./jdk-6u25-linux-x64.bin.1

[dc@ip-10-147-220-207 ~]$ sudo mkdir /usr/java[dc@ip-10-147-220-207 ~]$ sudo mv jdk1.6.0_25/ /usr/java [sudo] password for dc: [dc@ip-10-147-220-207 ~]$ cd /usr/java [dc@ip-10-147-220-207 java]$ ls jdk1.6.0_25

[dc@ip-10-147-220-207 java]$ sudo ln -s /usr/java/jdk1.6.0_25/ /usr/java/latest [dc@ip-10-147-220-207 java]$ ls -al total 12 drwxr-xr-x. 3 root root 4096 Mar 18 19:04 . drwxr-xr-x. 14 root root 4096 Mar 18 18:58 ..

drwxr-xr-x. 10 dc dc 4096 Mar 18 19:03 jdk1.6.0_25 lrwxrwxrwx. 1 root root 22 Mar 18 19:04 latest -> /usr/java/jdk1.6.0_25/ [dc@ip-10-147-220-207 java]$

[dc@ip-10-147-220-207 java]$ sudo ln -s /usr/java/latest /usr/java/default [dc@ip-10-147-220-207 java]$ ls -al total 12 drwxr-xr-x. 3 root root 4096 Mar 18 19:05 . drwxr-xr-x. 14 root root 4096 Mar 18 18:58 .. lrwxrwxrwx. 1 root root 16 Mar 18 19:05 default -> /usr/java/latest drwxr-xr-x. 10 dc dc 4096 Mar 18 19:03 jdk1.6.0_25 lrwxrwxrwx. 1 root root 22 Mar 18 19:04 latest -> /usr/java/jdk1.6.0_25/ [dc@ip-10-147-220-207 java]$

Edit .bashrc and add JDK_HOME

GNU nano 2.0.9 File: .bashrc Modified

# .bashrc

# Source global definitions if [ -f /etc/bashrc ]; then . /etc/bashrc fi

# User specific aliases and functions

export JAVA_HOME=/usr/java/default export PATH=$PATH:$JAVA_HOME/bin

[dc@ip-10-147-220-207 ~]$ source .bashrc

verify Java is installed

[dc@ip-10-147-220-207 ~]$ java -version java version "1.6.0_25" Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode) [dc@ip-10-147-220-207 ~]$

Install Hadoop:>sudo yum install hadoop\*Format the namenode

sudo /etc/init.d/hadoop-hdfs-namenode init

Start the daemons, the namenode and datanodes in pseudo-distributed mode.

[dc@ip-10-147-220-207 ~]$ for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i start ; done Starting Hadoop namenode: [ OK ] starting namenode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-10-147-220-207.out Starting Hadoop datanode: [ OK ] starting datanode, logging to /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-10-147-220-207.out

Now create the file directories for the Hadoop components.

Execute the following commands:

hadoop fs -ls -R /

sudo -u hdfs hadoop fs -mkdir -p /user/$USER

[dc@ip-10-147-220-207 ~]$ hadoop fs -ls -R /

drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user

drwxr-xr-x - hdfs supergroup 0 2014-03-18 19:55 /user/dc

sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER

[dc@ip-10-147-220-207 ~]$ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER

[dc@ip-10-147-220-207 ~]$ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER



drwxr-xr-x - dc dc 0 2014-03-18 19:55 /user/dc

[dc@ip-10-147-220-207 ~]$

sudo -u hdfs hadoop fs -chmod 770 /user/$USER



drwxrwx--- - dc dc 0 2014-03-18 19:55 /user/dc

[dc@ip-10-147-220-207 ~]$

sudo -u hdfs hadoop fs -mkdir /tmp

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

sudo -u hdfs hadoop fs -mkdir -p /var/log/hadoop-yarn

sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

sudo -u hdfs hadoop fs -mkdir -p /user/history

sudo -u hdfs hadoop fs -chown mapred:mapred /user/history

sudo -u hdfs hadoop fs -chmod 770 /user/history

sudo -u hdfs hadoop fs -mkdir -p /tmp/hadoop-yarn/staging

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging

sudo -u hdfs hadoop fs -mkdir -p

/tmp/hadoop-yarn/staging/history/done_intermediate

sudo -u hdfs hadoop fs -chmod -R 1777

/tmp/hadoop-yarn/staging/history/done_intermediate

sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging

There are a more subdirectories to create for the Hadoop components. Run the hdfs-init script...

sudo /usr/lib/hadoop/libexec/init-hdfs.sh

Start the YARN daemons

[dc@ip-10-147-220-207 ~]$ sudo service hadoop-yarn-resourcemanager start Starting Hadoop resourcemanager: [ OK ] starting resourcemanager, logging to /var/log/hadoop-yarn/yarn-yarn-resourcemanager-ip-10-147-220-207.out [dc@ip-10-147-220-207 ~]$ sudo service hadoop-yarn-nodemanager start Starting Hadoop nodemanager: [ OK ] starting nodemanager, logging to /var/log/hadoop-yarn/yarn-yarn-nodemanager-ip-10-147-220-207.out [dc@ip-10-147-220-207 ~]$

[dc@ip-10-147-220-207 ~]$sudo -u hdfs hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi 10 1000 Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8

Wrote input for Map #9 Starting Job 14/03/18 20:03:04 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 14/03/18 20:03:04 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. 14/03/18 20:03:04 INFO input.FileInputFormat: Total input paths to process : 10 14/03/18 20:03:04 INFO mapreduce.JobSubmitter: number of splits:10 14/03/18 20:03:04 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar 14/03/18 20:03:04 WARN conf.Configuration: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 14/03/18 20:03:04 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/03/18 20:03:04 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class 14/03/18 20:03:04 WARN conf.Configuration: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/03/18 20:03:04 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class 14/03/18 20:03:04 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name 14/03/18 20:03:04 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class 14/03/18 20:03:04 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class 14/03/18 20:03:04 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir 14/03/18 20:03:04 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir 14/03/18 20:03:04 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/03/18 20:03:04 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/03/18 20:03:04 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class 14/03/18 20:03:04 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir 14/03/18 20:03:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1395172880059_0001 14/03/18 20:03:05 INFO client.YarnClientImpl: Submitted application application_1395172880059_0001 to ResourceManager at /0.0.0.0:8032 14/03/18 20:03:05 INFO mapreduce.Job: The url to track the job: http://ip-10-147-220-207:8088/proxy/application_1395172880059_0001/ 14/03/18 20:03:05 INFO mapreduce.Job: Running job: job_1395172880059_0001

14/03/18 20:04:41 INFO mapreduce.Job: Job job_1395172880059_0001 completed successfully 14/03/18 20:04:41 INFO mapreduce.Job: Counters: 43

File System Counters FILE: Number of bytes read=226 FILE: Number of bytes written=814354 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=2610 HDFS: Number of bytes written=215 HDFS: Number of read operations=43 HDFS: Number of large read operations=0 HDFS: Number of write operations=3

Job Counters Launched map tasks=10 Launched reduce tasks=1 Rack-local map tasks=10 Total time spent by all maps in occupied slots (ms)=393538 Total time spent by all reduces in occupied slots (ms)=36734

Map-Reduce Framework Map input records=10 Map output records=20 Map output bytes=180 Map output materialized bytes=280 Input split bytes=1430 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=280 Reduce input records=20 Reduce output records=0 Spilled Records=40 Shuffled Maps =10 Failed Shuffles=0 Merged Map outputs=10 GC time elapsed (ms)=6221 CPU time spent (ms)=7220 Physical memory (bytes) snapshot=2499035136 Virtual memory (bytes) snapshot=6820196352 Total committed heap usage (bytes)=1643577344

Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0

File Input Format Counters Bytes Read=1180

File Output Format Counters Bytes Written=97

Job Finished in 97.444 seconds Estimated value of Pi is 3.14080000000000000000

The single node is now running both HDFS and MR/YARN. Start 2 other nodes and verify these work in pseudo-distrributed mode.

Make sure the 2 instances launch in the same region/availability zone

Create users, do not install using root and install Hadoop using the yum repo procedure listed above.

Stopping and starting AWS instances:Start/Stop the instances from AWS manager. ssh, su $USERNAME, cd

To view the Namenode UI, disable selinux and turn off iptables. 1) Change /etc/selinux/config from enforcing to disabled and reboot. 2) Turn off iptables using > sudo service iptables stop

To prevent the iptables services from starting use >sudo chkconfig iptables stopVerify by using >sudo chkconfig –list iptables[dc@ip-10-85-91-254 ~]$ sudo chkconfig --list iptables

iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off

Normally iptables look like:

[root@ip-10-85-31-193 ~]# sudo chkconfig --list iptables

iptables 0:off 1:off 2:on 3:on 4:on 5:on 6:off [root@ip-10-85-31-193 ~]#

Converting to Distributed mode from Pseudo-distributed

Setup up networking and ssh. Make sure you can ping each server and ssh into each other node. Label each instance with nn, dn1, dn2, dn3. We are going to setup a cluster with a nn and 3 datanodes. Label where each service is started so you know which server is running which service.

Ifconfig lists the private DNS addresses:

[dc@ip-10-85-91-254 ~]$ ifconfig eth0 Link encap:Ethernet HWaddr 12:31:3D:15:4C:10 inet addr:10.85.91.254 Bcast:10.85.91.255 Mask:255.255.254.0 inet6 addr: fe80::1031:3dff:fe15:4c10/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2897 errors:0 dropped:0 overruns:0 frame:0 TX packets:2377 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:245482 (239.7 KiB) TX bytes:260672 (254.5 KiB) Interrupt:247

lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:48416 errors:0 dropped:0 overruns:0 frame:0 TX packets:48416 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:6738928 (6.4 MiB) TX bytes:6738928 (6.4 MiB)

Add DNS names which correspond to the text files just created.

The /etc/hosts files on all 3 instances should be the same:

[dc@ip-10-62-122-8 ~]$ cat /etc/hosts 10.85.91.254 dn1 10.85.31.193 nn 10.62.122.8 dn3

127.0.0.1 localhost.localdomain localhost

::1 localhost6.localdomain6 localhost6

Ping 3 nodes from each node to verify the setup is correct.

Ping nn;ping dn1; ping dn3

Set replication factor to 3Start datanodes servicesStart namenode datanodes

Verify 3 nodes are up using the namenode UI:

Verify blocks replicated:[dc@ip-10-85-31-193 hadoop-hdfs]$ sudo -u hdfs hadoop fsck /testdir/testfile -files -blocks

[dc@ip-10-85-31-193 hadoop-hdfs]$ sudo -u hdfs hadoop fsck /testdir/testfile -files -blocks DEPRECATED: Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it.

Connecting to namenode via http://nn:50070 FSCK started by hdfs (auth:SIMPLE) from /10.85.31.193 for path /testdir/testfile at Wed Mar 19 20:48:46 UTC 2014 /testdir/testfile 126 bytes, 1 block(s): OK 0. BP-1370670758-10.85.31.193-1395261360244:blk_2735064026059661308_1002 len=126 repl=3

Status: HEALTHY Total size: 126 B Total dirs: 0 Total files: 1 Total blocks (validated): 1 (avg. block size 126 B) Minimally replicated blocks:1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Wed Mar 19 20:48:46 UTC 2014 in 1 milliseconds

The filesystem under path '/testdir/testfile' is HEALTHY [dc@ip-10-85-31-193 hadoop-hdfs]$ ls hadoop-hdfs-datanode-ip-10-85-31-193.log hadoop-hdfs-datanode-ip-10-85-31-193.out hadoop-hdfs-namenode-ip-10-85-31-193.log hadoop-hdfs-namenode-ip-10-85-31-193.out

SecurityAuth-hdfs.audit testfile

capital onehadoopclass

Education

dc root

root root

su dc dc

user dc

usrjavalatest dc

passwd dc

cd usrjava dc

sudo mkdir usrjava dc