hands on hadoop and pig
DESCRIPTION
More details at http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pigTRANSCRIPT
BigData using Hadoop and Pig
Sudar MuthuResearch Engineer
Yahoo Labshttp://sudarmuthu.com
http://twitter.com/sudarmuthu
Who am I?
Research Engineer at Yahoo Labs Mines useful information from huge
datasets Worked on both structured and
unstructured data. Builds robots as hobby ;)
What we will see today?
What is BigData? Get our hands dirty with Hadoop See some code Try out Pig Glimpse of Hbase and Hive
What is BigData?
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools
“ ”
http://en.wikipedia.org/wiki/Big_data
How big is BigData?
1GB today is not the same as 1GB just 10
years before
Anything that doesn’t fit into the RAM of a
single machine
Types of Big Data
Data in Movement (streams)
Twitter/Facebook comments Stock market data Access logs of a busy web server Sensors: Vital signs of a newly born
Data at rest (Oceans)
Collection of what has streamed Emails or IM messages Social Media Unstructured documents: forms,
claims
We have all this data and need to find a
way to process them
Traditional way of scaling (Scaling up)
Make the machine more powerful Add more RAM Add more cores to CPU
It is going to be very expensive Will be limited by disk seek and read
time Single point of failure
New way to scale up (Scale out)
Add more instances of the same machine
Cost is less compared to scaling up Immune to failure of a single or a set
of nodes Disk seek and write time is not going
to be bottleneck Future safe (to some extend)
Is it fit for ALL types of problems?
Divide and conquer
Hadoop
A scalable, fault-tolerant grid
operating system for data storage and
processing
What is Hadoop?
Runs on Commodity hardware HDFS: Fault-tolerant high-
bandwidth clustered storage MapReduce: Distributed data
processing Works with structured and
unstructured data Open source, Apache license Master (named-node) – Slave
architecture
Design Principles
System shall manage and heal itself Performance shall scale linearly Algorithm should move to data
Lower latency, lower bandwidth Simple core, modular and extensible
Components of Hadoop
HDFS Map Reduce PIG HBase Hive
Getting started with Hadoop
What I am not going to cover?
Installation or setting up Hadoop Will be running all the code in a single
node instance Monitoring of the clusters Performance tuning User authentication or quota
Before we get into code, let’s
understand some concepts
Map Reduce
Framework for distributed
processing of large datasets
MapReduce
Consists of two functions Map
Filter and transform the input, which the reducer can understand
Reduce Aggregate over the input provided by
the Map function
Formal definition
Map <k1, v1> -> list(<k2,v2>)
Reduce <k2, list(v2)> -> list <k3, v3>
Let’s see some examples
Count number of words in files
Map<file_name, file_contents> => list<word, count>
Reduce<word, list(count)> => <word, sum_of_counts>
Count number of words in files
Map<“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}
Count number of words in files
Reduce{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>}
=>
{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
Max temperature in a year
Map<file_name, file_contents> => <year, temp>
Reduce<year, list(temp)> => <year, max_temp>
HDFS
HDFS
Distributed file system Data is distributed over different
nodes Will be replicated for fail over Is abstracted out for the algorithms
HDFS Commands
HDFS Commands
hadoop fs –mkdir <dir_name> hadoop fs –ls <dir_name> hadoop fs –rmr <dir_name> hadoop fs –put <local_file>
<remote_dir> hadoop fs –get <remote_file>
<local_dir> hadoop fs –cat <remote_file> hadoop fs –help
Let’s write some code
Count Words Demo
Create a mapper class Override map() method
Create a reducer class Override reduce() method
Create a main method Create JAR Run it on Hadoop
Map Method
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString(); StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); }}
Reduce Method
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum));}
Main Method
Job job = new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words");
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
Run it on Hadoop
hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWords input/ output/
Outputat 1be 3can 7can't 1code 2command 1connect 1consider1continued 1control 4could 1couple 1courtesy1desktop, 1detailed 1details 1…..…..
Pig
What is Pig?
Pig provides an abstraction for processing large datasets
Consists of Pig Latin – Language to express data
flows Execution environment
Why we need Pig?
MapReduce can get complex if your data needs lot of processing/transformations
MapReduce provides primitive data structures
Pig provides rich data structures Supports complex operations like
joins
Running Pig programs
In an interactive shell called Grunt As a Pig Script Embedded into Java programs (like
JDBC)
Grunt – Interactive Shell
Grunt shell
fs commands – like hadoop fs fs –ls Fs –mkdir
fs copyToLocal <file> fs copyFromLocal <local_file>
<dest> exec – execute Pig scripts sh – execute shell scripts
Let’s see them in action
Pig Latin
LOAD – Read files DUMP – Dump data in the console JOIN – Do a join on data sets FILTER – Filter data sets SORT – Sort data STORE – Store data back in files
Let’s see some code
Sort words based on count
Filter words present in a list
HBase
What is Hbase?
Distributed, column-oriented database built on top of HDFS
Useful when real-time read/write random-access to very large datasets is needed.
Can handle billions of rows with millions of columns
Hive
What is Hive?
Useful for managing and querying structured data
Provides SQL like syntax Meta data is stored in a RDBMS Extensible with types, functions ,
scripts etc
Hadoop Affordable
Storage/Compute Structured or
Unstructured Resilient Auto
Scalability
Relational Databases
Interactive response times
ACID Structured data Cost/Scale prohibitive
Thank You