hands on hadoop and pig

BigData using Hadoop and Pig

Sudar MuthuResearch Engineer

Yahoo Labshttp://sudarmuthu.com

http://twitter.com/sudarmuthu

http://sudarmuthu.com/

Who am I?

Research Engineer at Yahoo Labs Mines useful information from huge

datasets Worked on both structured and

unstructured data. Builds robots as hobby ;)

What we will see today?

What is BigData? Get our hands dirty with Hadoop See some code Try out Pig Glimpse of Hbase and Hive

What is BigData?

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools

“ ”

http://en.wikipedia.org/wiki/Big_data

How big is BigData?

1GB today is not the same as 1GB just 10

years before

Anything that doesn’t fit into the RAM of a

single machine

Types of Big Data

Data in Movement (streams)

Twitter/Facebook comments Stock market data Access logs of a busy web server Sensors: Vital signs of a newly born

Data at rest (Oceans)

Collection of what has streamed Emails or IM messages Social Media Unstructured documents: forms,

claims

We have all this data and need to find a

way to process them

Traditional way of scaling (Scaling up)

Make the machine more powerful Add more RAM Add more cores to CPU

It is going to be very expensive Will be limited by disk seek and read

time Single point of failure

New way to scale up (Scale out)

Add more instances of the same machine

Cost is less compared to scaling up Immune to failure of a single or a set

of nodes Disk seek and write time is not going

to be bottleneck Future safe (to some extend)

Is it fit for ALL types of problems?

Divide and conquer

Hadoop

A scalable, fault-tolerant grid

operating system for data storage and

processing

What is Hadoop?

Runs on Commodity hardware HDFS: Fault-tolerant high-

bandwidth clustered storage MapReduce: Distributed data

processing Works with structured and

unstructured data Open source, Apache license Master (named-node) – Slave

architecture

Design Principles

System shall manage and heal itself Performance shall scale linearly Algorithm should move to data

Lower latency, lower bandwidth Simple core, modular and extensible

Components of Hadoop

HDFS Map Reduce PIG HBase Hive

Getting started with Hadoop

What I am not going to cover?

Installation or setting up Hadoop Will be running all the code in a single

node instance Monitoring of the clusters Performance tuning User authentication or quota

Before we get into code, let’s

understand some concepts

Map Reduce

Framework for distributed

processing of large datasets

MapReduce

Consists of two functions Map

Filter and transform the input, which the reducer can understand

Reduce Aggregate over the input provided by

the Map function

Formal definition

Map <k1, v1> -> list(<k2,v2>)

Reduce <k2, list(v2)> -> list <k3, v3>

Let’s see some examples

Count number of words in files

Map<file_name, file_contents> => list<word, count>

Reduce<word, list(count)> => <word, sum_of_counts>


Map<“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}


Reduce{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}

Max temperature in a year

Map<file_name, file_contents> => <year, temp>

Reduce<year, list(temp)> => <year, max_temp>

HDFS

Distributed file system Data is distributed over different

nodes Will be replicated for fail over Is abstracted out for the algorithms

HDFS Commands

HDFS Commands

hadoop fs –mkdir <dir_name> hadoop fs –ls <dir_name> hadoop fs –rmr <dir_name> hadoop fs –put <local_file>

<remote_dir> hadoop fs –get <remote_file>

<local_dir> hadoop fs –cat <remote_file> hadoop fs –help

Let’s write some code

Count Words Demo

Create a mapper class Override map() method

Create a reducer class Override reduce() method

Create a main method Create JAR Run it on Hadoop

Map Method

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString(); StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); }}

Reduce Method

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum));}

Main Method

Job job = new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

Run it on Hadoop

hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWords input/ output/

Outputat 1be 3can 7can't 1code 2command 1connect 1consider1continued 1control 4could 1couple 1courtesy1desktop, 1detailed 1details 1…..…..

What is Pig?

Pig provides an abstraction for processing large datasets

Consists of Pig Latin – Language to express data

flows Execution environment

Why we need Pig?

MapReduce can get complex if your data needs lot of processing/transformations

MapReduce provides primitive data structures

Pig provides rich data structures Supports complex operations like

joins

Running Pig programs

In an interactive shell called Grunt As a Pig Script Embedded into Java programs (like

JDBC)

Grunt – Interactive Shell

Grunt shell

fs commands – like hadoop fs fs –ls Fs –mkdir

fs copyToLocal <file> fs copyFromLocal <local_file>

<dest> exec – execute Pig scripts sh – execute shell scripts

Let’s see them in action

Pig Latin

LOAD – Read files DUMP – Dump data in the console JOIN – Do a join on data sets FILTER – Filter data sets SORT – Sort data STORE – Store data back in files

Let’s see some code

Sort words based on count

Filter words present in a list

What is Hbase?

Distributed, column-oriented database built on top of HDFS

Useful when real-time read/write random-access to very large datasets is needed.

Can handle billions of rows with millions of columns

What is Hive?

Useful for managing and querying structured data

Provides SQL like syntax Meta data is stored in a RDBMS Extensible with types, functions ,

scripts etc

Hadoop Affordable

Storage/Compute Structured or

Unstructured Resilient Auto

Scalability

Relational Databases

Interactive response times

ACID Structured data Cost/Scale prohibitive

Thank You

hands on hadoop and pig

Technology