hands on hadoop and pig

63
BigData using Hadoop and Pig Sudar Muthu Research Engineer Yahoo Labs http://sudarmuthu.com http://twitter.com/sudarmuthu

Upload: sudar-muthu

Post on 15-Jan-2015

1.682 views

Category:

Technology


0 download

DESCRIPTION

More details at http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig

TRANSCRIPT

Page 1: Hands on Hadoop and pig

BigData using Hadoop and Pig

Sudar MuthuResearch Engineer

Yahoo Labshttp://sudarmuthu.com

http://twitter.com/sudarmuthu

Page 2: Hands on Hadoop and pig

Who am I?

Research Engineer at Yahoo Labs Mines useful information from huge

datasets Worked on both structured and

unstructured data. Builds robots as hobby ;)

Page 3: Hands on Hadoop and pig

What we will see today?

What is BigData? Get our hands dirty with Hadoop See some code Try out Pig Glimpse of Hbase and Hive

Page 4: Hands on Hadoop and pig

What is BigData?

Page 5: Hands on Hadoop and pig

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools

“ ”

http://en.wikipedia.org/wiki/Big_data

Page 6: Hands on Hadoop and pig

How big is BigData?

Page 7: Hands on Hadoop and pig

1GB today is not the same as 1GB just 10

years before

Page 8: Hands on Hadoop and pig

Anything that doesn’t fit into the RAM of a

single machine

Page 9: Hands on Hadoop and pig

Types of Big Data

Page 10: Hands on Hadoop and pig

Data in Movement (streams)

Twitter/Facebook comments Stock market data Access logs of a busy web server Sensors: Vital signs of a newly born

Page 11: Hands on Hadoop and pig

Data at rest (Oceans)

Collection of what has streamed Emails or IM messages Social Media Unstructured documents: forms,

claims

Page 12: Hands on Hadoop and pig

We have all this data and need to find a

way to process them

Page 13: Hands on Hadoop and pig

Traditional way of scaling (Scaling up)

Make the machine more powerful Add more RAM Add more cores to CPU

It is going to be very expensive Will be limited by disk seek and read

time Single point of failure

Page 14: Hands on Hadoop and pig

New way to scale up (Scale out)

Add more instances of the same machine

Cost is less compared to scaling up Immune to failure of a single or a set

of nodes Disk seek and write time is not going

to be bottleneck Future safe (to some extend)

Page 15: Hands on Hadoop and pig

Is it fit for ALL types of problems?

Page 16: Hands on Hadoop and pig

Divide and conquer

Page 17: Hands on Hadoop and pig

Hadoop

Page 18: Hands on Hadoop and pig

A scalable, fault-tolerant grid

operating system for data storage and

processing

Page 19: Hands on Hadoop and pig

What is Hadoop?

Runs on Commodity hardware HDFS: Fault-tolerant high-

bandwidth clustered storage MapReduce: Distributed data

processing Works with structured and

unstructured data Open source, Apache license Master (named-node) – Slave

architecture

Page 20: Hands on Hadoop and pig

Design Principles

System shall manage and heal itself Performance shall scale linearly Algorithm should move to data

Lower latency, lower bandwidth Simple core, modular and extensible

Page 21: Hands on Hadoop and pig

Components of Hadoop

HDFS Map Reduce PIG HBase Hive

Page 22: Hands on Hadoop and pig

Getting started with Hadoop

Page 23: Hands on Hadoop and pig

What I am not going to cover?

Installation or setting up Hadoop Will be running all the code in a single

node instance Monitoring of the clusters Performance tuning User authentication or quota

Page 24: Hands on Hadoop and pig

Before we get into code, let’s

understand some concepts

Page 25: Hands on Hadoop and pig

Map Reduce

Page 26: Hands on Hadoop and pig

Framework for distributed

processing of large datasets

Page 27: Hands on Hadoop and pig

MapReduce

Consists of two functions Map

Filter and transform the input, which the reducer can understand

Reduce Aggregate over the input provided by

the Map function

Page 28: Hands on Hadoop and pig

Formal definition

Map <k1, v1> -> list(<k2,v2>)

Reduce <k2, list(v2)> -> list <k3, v3>

Page 29: Hands on Hadoop and pig

Let’s see some examples

Page 30: Hands on Hadoop and pig

Count number of words in files

Map<file_name, file_contents> => list<word, count>

Reduce<word, list(count)> => <word, sum_of_counts>

Page 31: Hands on Hadoop and pig

Count number of words in files

Map<“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}

Page 32: Hands on Hadoop and pig

Count number of words in files

Reduce{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}

Page 33: Hands on Hadoop and pig

Max temperature in a year

Map<file_name, file_contents> => <year, temp>

Reduce<year, list(temp)> => <year, max_temp>

Page 34: Hands on Hadoop and pig

HDFS

Page 35: Hands on Hadoop and pig

HDFS

Distributed file system Data is distributed over different

nodes Will be replicated for fail over Is abstracted out for the algorithms

Page 36: Hands on Hadoop and pig
Page 37: Hands on Hadoop and pig
Page 38: Hands on Hadoop and pig

HDFS Commands

Page 39: Hands on Hadoop and pig

HDFS Commands

hadoop fs –mkdir <dir_name> hadoop fs –ls <dir_name> hadoop fs –rmr <dir_name> hadoop fs –put <local_file>

<remote_dir> hadoop fs –get <remote_file>

<local_dir> hadoop fs –cat <remote_file> hadoop fs –help

Page 40: Hands on Hadoop and pig

Let’s write some code

Page 41: Hands on Hadoop and pig

Count Words Demo

Create a mapper class Override map() method

Create a reducer class Override reduce() method

Create a main method Create JAR Run it on Hadoop

Page 42: Hands on Hadoop and pig

Map Method

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString(); StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); }}

Page 43: Hands on Hadoop and pig

Reduce Method

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum));}

Page 44: Hands on Hadoop and pig

Main Method

Job job = new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);

Page 45: Hands on Hadoop and pig

Run it on Hadoop

hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWords input/ output/

Page 46: Hands on Hadoop and pig

Outputat 1be 3can 7can't 1code 2command 1connect 1consider1continued 1control 4could 1couple 1courtesy1desktop, 1detailed 1details 1…..…..

Page 47: Hands on Hadoop and pig

Pig

Page 48: Hands on Hadoop and pig

What is Pig?

Pig provides an abstraction for processing large datasets

Consists of Pig Latin – Language to express data

flows Execution environment

Page 49: Hands on Hadoop and pig

Why we need Pig?

MapReduce can get complex if your data needs lot of processing/transformations

MapReduce provides primitive data structures

Pig provides rich data structures Supports complex operations like

joins

Page 50: Hands on Hadoop and pig

Running Pig programs

In an interactive shell called Grunt As a Pig Script Embedded into Java programs (like

JDBC)

Page 51: Hands on Hadoop and pig

Grunt – Interactive Shell

Page 52: Hands on Hadoop and pig

Grunt shell

fs commands – like hadoop fs fs –ls Fs –mkdir

fs copyToLocal <file> fs copyFromLocal <local_file>

<dest> exec – execute Pig scripts sh – execute shell scripts

Page 53: Hands on Hadoop and pig

Let’s see them in action

Page 54: Hands on Hadoop and pig

Pig Latin

LOAD – Read files DUMP – Dump data in the console JOIN – Do a join on data sets FILTER – Filter data sets SORT – Sort data STORE – Store data back in files

Page 55: Hands on Hadoop and pig

Let’s see some code

Page 56: Hands on Hadoop and pig

Sort words based on count

Page 57: Hands on Hadoop and pig

Filter words present in a list

Page 58: Hands on Hadoop and pig

HBase

Page 59: Hands on Hadoop and pig

What is Hbase?

Distributed, column-oriented database built on top of HDFS

Useful when real-time read/write random-access to very large datasets is needed.

Can handle billions of rows with millions of columns

Page 60: Hands on Hadoop and pig

Hive

Page 61: Hands on Hadoop and pig

What is Hive?

Useful for managing and querying structured data

Provides SQL like syntax Meta data is stored in a RDBMS Extensible with types, functions ,

scripts etc

Page 62: Hands on Hadoop and pig

Hadoop Affordable

Storage/Compute Structured or

Unstructured Resilient Auto

Scalability

Relational Databases

Interactive response times

ACID Structured data Cost/Scale prohibitive

Page 63: Hands on Hadoop and pig

Thank You