installation and setup hadoop published

DIPENDRA KUSI 2/1/17

https://www.linkedin.com/in/er-dipendra-kusi-b3674193

HADOOP SETUP

Installation and setup Hadoop



HADOOP SETUP

Step 1: First go to virtual box site and download the virtual box:

https://www.virtualbox.org/wiki/Downloads

Step 2: Go to cloudera site and download cloudera

http://www.cloudera.com/downloads/quickstart_vms/5-8.html



HADOOP SETUP

Step 3: Run the cloudera in virtual box

Step 4:

Now check whether the Hadoop is running or not through terminal

$ Hadoop version



HADOOP SETUP

Step 5:

Also, check Hadoop configuration through browser



HADOOP SETUP

Step 6:

Now go to site: http://tiny.cloudera.com/hadoopTutorialSample.

And download the source code of word count and extract it.

Step7:

Now open the terminal in this wordcount.jar location.

Create the own folder for input data:

$ Hadoop fs -mkdir /user/cloudera/Hadoop_data /user/cloudera/Hadoop_data/input



HADOOP SETUP

Step 8:

Now put the file to be process in /user/cloudera/Hadoop_data/input folder

$ Hadoop fs -put file0 /user/cloudera/Hadoop_data/input



HADOOP SETUP

Step 9:

Now run the word count jar in Hadoop to process the word in file0.

$ Hadoop jar wordcount.jar /user/cloudera/Hadoop_data/input /user/cloudera/Hadoop_data/output



HADOOP SETUP

Running this command, exception occur saying “ClassNotFoundException”. This mean that jar file has no

explicity define the running class so let define the running class which is in org.myorg.WordCount class

Now wordcount.jar is running is Hadoop

$ Hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/Hadoop_data/input

/user/cloudera/Hadoop_data/output



HADOOP SETUP

Step 10:

Now check the output contain:

$ Hadoop fs -cat /user/cloudera/Hadoop_data/output/*



HADOOP SETUP

So, the output has word count as expected.

Create Jar file and run in Hadoop

Step 11: Now let’s create java file in eclipse and export it to jar and run in Hadoop

First create project Hadoop_first_project in eclipse



HADOOP SETUP

Now create WordCount class and paste the below code:

import java.io.IOException;

import java.util.regex.Pattern;

import org.apache.hadoop.conf.Configured;



HADOOP SETUP

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.log4j.Logger;

public class WordCount extends Configured implements Tool {

private static final Logger LOG = Logger.getLogger(WordCount.class);

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new WordCount(), args);

System.exit(res);

}

public int run(String[] args) throws Exception {

Job job = Job.getInstance(getConf(), "wordcount");

job.setJarByClass(this.getClass());

// Use TextInputFormat, the default unless job.setInputFormatClass is used



HADOOP SETUP

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

private long numRecords = 0;

private static final Pattern WORD_BOUNDARY = Pattern.compile("\\s*\\b\\s*");

public void map(LongWritable offset, Text lineText, Context context)

throws IOException, InterruptedException {

String line = lineText.toString();

Text currentWord = new Text();

for (String word : WORD_BOUNDARY.split(line)) {

if (word.isEmpty()) {

continue;

}

currentWord = new Text(word);

context.write(currentWord,one);

}

}



HADOOP SETUP

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override

public void reduce(Text word, Iterable<IntWritable> counts, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable count : counts) {

sum += count.get();

}

context.write(word, new IntWritable(sum));

}

}

}

Here, hadoop library is missing so let load the required library.

Go to project property



HADOOP SETUP

Go to java build path and libraries:

Now, click on add external jars and add jar from following location

File System -> usr -> lib ->Hadoop

And add all the jar file



HADOOP SETUP

Go to client-0.20 folder and add all jar from there as well

Go to lib folder and add all jar from there as well

Click on ok. You will see all the error will disappear.

Now export the project to jar file.

Right click on project-> export



HADOOP SETUP

Click on jar file->next



HADOOP SETUP

Now select the project and select the export location of jar file and click next and then next



HADOOP SETUP

Click on browse to select the main running class



HADOOP SETUP

Click ok-> finish



HADOOP SETUP

Now go to export mywordcount.jar location.

Run command:

Delete the output folder that has been created previously

$hadoop fs -rm -r /user/cloudera/Hadoop_data/output

And run the jar in Hadoop(no need to define the class since we have already defined the class entry

point during the export)



HADOOP SETUP

$ Hadoop jar mywordcount.jar /user/cloudera/Hadoop_data/input

/user/cloudera/Hadoop_data/input/output

installation and setup hadoop published

Data & Analytics