big data on public cloud using cloudera on gogrid & amazon emr
DESCRIPTION
Introduction to Hadoop Using Cloudera Live & Amazon EMR: Hand on Lab by IMC InstituteTRANSCRIPT
![Page 1: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/1.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1
Big Data on Public CloudUsing Cloudera on GoGrid
& Amazon EMR
Hands On WorkshopOctober 2014
Dr.Thanachart NumnondaIMC Institute
Modifiy from Original Version by Danairat T.Certified Java Programmer, TOGAF – Silver
![Page 2: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/2.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running Hadoop
![Page 3: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/3.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running this lab using Cloudera Live
![Page 4: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/4.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Cloudera Live on GoGrid
![Page 5: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/5.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Alternatively Using Cloudera VM
![Page 6: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/6.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hue on Cloudera
![Page 7: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/7.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Viewing HDFS
![Page 8: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/8.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
![Page 9: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/9.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Importing/Exporting Data to HDFS
![Page 10: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/10.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing Data to Hadoop
Download War and Peace Full Text
www.gutenberg.org/ebooks/2600
$hadoop fs -mkdir input
$hadoop fs -mkdir output
$hadoop fs -copyFromLocal Downloads/pg2600.txt input
![Page 11: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/11.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS
[hdadmin@localhost bin]$ hadoop fs -cat input/pg2600.txt
List HDFS File
Read HDFS File
Retrieve HDFS File to Local File System
Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
[hdadmin@localhost bin]$ hadoop fs -copyToLocal input/pg2600.txt tmp/file.txt
![Page 12: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/12.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS using File Browse
![Page 13: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/13.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Review file in Hadoop HDFS using Hue
![Page 14: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/14.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hadoop Port Numbers
Daemon Default Port
Configuration Parameter in conf/*-site.xml
HDFS Namenode 50070 dfs.http.address
Datanodes 50075 dfs.datanode.http.address
Secondarynamenode 50090 dfs.secondary.http.address
MR JobTracker 50030 mapred.job.tracker.http.address
Tasktrackers 50060 mapred.task.tracker.http.address
![Page 15: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/15.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Removing data from HDFS using Shell Command
hdadmin@localhost detach]$ hadoop fs -rm input/pg2600.txt
Deleted hdfs://localhost:54310/input/pg2600.txt
hdadmin@localhost detach]$
![Page 16: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/16.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Map Reduce Processing
Client
Name Node Job Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Map Reduce
![Page 17: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/17.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
High Level Architecture of MapReduce
![Page 18: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/18.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 18
Before MapReduce…
● Large scale data processing was difficult!– Managing hundreds or thousands of processors– Managing parallelization and distribution– I/O Scheduling– Status and monitoring– Fault/crash tolerance
● MapReduce provides all of these, easily!
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 19: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/19.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 19
MapReduce Overview
● What is it?– Programming model used by Google– A combination of the Map and Reduce models with an
associated implementation– Used for processing and generating large data sets
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 20: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/20.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 20
MapReduce Overview
● How does it solve our previously mentioned problems?– MapReduce is highly scalable and can be used across many
computers.– Many small machines can be used to process jobs that
normally could not be processed by a large machine.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 21: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/21.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
Source: www.bigdatauniversity.com
![Page 22: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/22.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
![Page 23: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/23.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
How does the MapReduce work?
Output in a list of (Key, List of Values)
in the intermediate file
Sorting
Partitioning
Output in a list of (Key, Value)
in the intermediate file
InputSplit
RecordReader
RecordWriter
![Page 24: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/24.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 24
Map Abstraction
● Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate
● Evaluation– Function defined by user– Applies to every value in value input
● Might need to parse input● Produces a new list of key/value pairs
– Can be different type from input pair
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 25: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/25.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 25
Reduce Abstraction
● Starts with intermediate Key / Value pairs● Ends with finalized Key / Value pairs
● Starting pairs are sorted by key● Iterator supplies the values for a given key to the
Reduce function.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 26: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/26.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 26
Reduce Abstraction
● Typically a function that:– Starts with a large number of key/value pairs
● One key/value for each word in all files being greped (including multiple entries for the same word)
– Ends with very few key/value pairs● One key/value for each unique word across all the files with
the number of instances summed into this entry● Broken up so a given worker works with input of the
same key.
Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html
![Page 27: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/27.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 27
How Map and Reduce Work Together
● Map returns information● Reduces accepts information● Reduce applies a user defined function to reduce the
amount of data
![Page 28: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/28.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 28
Other Applications
● Yahoo!– Webmap application uses Hadoop to create a database of
information on all known webpages● Facebook
– Hive data center uses Hadoop to provide business statistics to application developers and advertisers
● Rackspace– Analyzes sever log files and usage data using Hadoop
![Page 29: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/29.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
MapReduce Framework
map: (K1, V1) -> list(K2, V2))
reduce: (K2, list(V2)) -> list(K3, V3)
![Page 30: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/30.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Writing you own Map Reduce Program
![Page 31: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/31.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)1. package org.myorg;
2.
3. import java.io.IOException; 4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();
17.
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }
![Page 32: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/32.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
27.
28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }
37.
![Page 33: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/33.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Wordcount (HelloWord in Hadoop)
38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf); 57. } 58. }
59.
![Page 34: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/34.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Packaging Map Reduce and Deploying to Hadoop Runtime
Environment
![Page 35: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/35.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Packaging Map Reduce Program
Usage
Assuming we have already downloaded wordcount.jar into the folder /Downloads:
![Page 36: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/36.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Job in Hue
![Page 37: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/37.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Job in Hue
![Page 38: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/38.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
![Page 39: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/39.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
![Page 40: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/40.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing MapReduce Output Result
![Page 41: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/41.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running MapReduce Job Using Oozie: Select Java
![Page 42: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/42.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running WordCount.jar on Amazon EMR
![Page 43: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/43.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview of Amazon EMR
![Page 44: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/44.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating an AWS account
![Page 45: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/45.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Signing up for the necessary services
● Simple Storage Service (S3)● Elastic Compute Cloud (EC2)● Elastic MapReduce (EMR)
Caution! This costs real money!
![Page 46: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/46.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating Amazon S3 bucket
![Page 47: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/47.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Upload .jar file and input file to Amazon S3
1. Select <yourbucket> in Amazon S3 service
2. Create folder : applications
3. Upload wordcount.jar to the applications folder
4. Create another folder: input
5. Upload input_test.txt to the input folder
![Page 48: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/48.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a new cluster in EMR
![Page 49: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/49.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Assign a cluster name and log location
![Page 50: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/50.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Assign hardware configuration
![Page 51: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/51.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select Custom JAR
![Page 52: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/52.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Input JAR Location and Arguments
![Page 53: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/53.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Select Create Cluster
![Page 54: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/54.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Monitoring the cluster
![Page 55: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/55.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
View Result from the S3 bucket
![Page 56: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/56.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding Hive
![Page 57: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/57.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionA Petabyte Scale Data Warehouse Using Hadoop
Hive is developed by Facebook, designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL
![Page 58: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/58.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
What Hive is NOT
Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs, etc.).
![Page 59: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/59.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 59
Hive Metastore
● Store Hive metadata
● Configurations
– Embedded: in-process metastore, in-process database
– Local: in-process metastore, out-of-process database
– Remote: out-of-process metastore,out-of-process database
![Page 60: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/60.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 60
Hive Schema-On-Read
● Faster loads into the database (simply copy or move)
● Slower queries
● Flexibility – multiple schemas for the same data
![Page 61: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/61.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 61
HiveQL
● Hive Query Language● SQL dialect● No support for:
– UPDATE, DELETE
– Transactions
– Indexes
– HAVING clause in SELECT
– Updateable or materialized views
– Srored procedure
![Page 62: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/62.jpg)
Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 62
Hive Tables
● Managed- CREATE TABLE
– LOAD- File moved into Hive's data warehouse directory
– DROP- Both data and metadata are deleted.
● External- CREATE EXTERNAL TABLE
– LOAD- No file moved
– DROP- Only metadata deleted
– Use when sharing data between Hive and Hadoop applications
or you want to use multiple schema on the same data
![Page 63: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/63.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Hive
Hive Shell
● Interactive
hive● Script
hive -f myscript● Inline
hive -e 'SELECT * FROM mytable'
Hive.apache.org
![Page 64: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/64.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
System Architecture and Components
• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a
sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and
implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.
• UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).
• Clients: Command line client similar to Mysql command line.
hive.apache.org
![Page 65: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/65.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
HDFS
Hive CLIQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgm
t.
Web
UI
HDFS
DDL
Hive
Hive.apache.org
![Page 66: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/66.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Sample HiveQL
The Query compiler uses the information stored in the metastore to convert SQL queries into a sequence of map/reduce jobs, e.g. the following query
SELECT * FROM t where t.c = 'xyz'
SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)
SELECT t1.c1, count(1) from t1 group by t1.c1
Hive.apache.org
![Page 67: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/67.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Creating Table and Retrieving Data using Hive
![Page 68: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/68.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Hive from terminal
Starting Hive
hive> quit;
Quit from Hive
![Page 69: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/69.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Hive Editor from Hue
Scroll Downthe web page
![Page 70: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/70.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Creating Hive Table
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 4.069 seconds
hive (default)> show tables;
OK
test_tbl
Time taken: 0.138 seconds
hive (default)> describe test_tbl;
OK
id int
country string
Time taken: 0.147 seconds
hive (default)>
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
![Page 71: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/71.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
![Page 72: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/72.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
![Page 73: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/73.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Hue Query Editor
See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html
![Page 74: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/74.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing Hive Table in HDFS
![Page 75: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/75.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Alter and Drop Hive Table
hive (default)> alter table test_tbl add columns (remarks STRING);
hive (default)> describe test_tbl;
OK
id int
country string
remarks string
Time taken: 0.077 seconds
hive (default)> drop table test_tbl;
OK
Time taken: 0.9 seconds
See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html
![Page 76: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/76.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data to Hive Table
$ hive
hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Creating Hive table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/country.csv' INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data.csv
Copying file: file:/tmp/test_tbl_data.csv
Loading data to table default.test_tbl
OK
Time taken: 0.241 seconds
hive (default)>
Loading data to Hive table
![Page 77: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/77.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Querying Data from Hive Table
hive (default)> select * from test_tbl;
OK
1 USA
62 Indonesia
63 Philippines
65 Singapore
66 Thailand
Time taken: 0.287 seconds
hive (default)>
![Page 78: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/78.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Insert Overwriting the Hive Table
hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data_updated.csv' overwrite INTO TABLE test_tbl;
Copying data from file:/tmp/test_tbl_data_updated.csv
Copying file: file:/tmp/test_tbl_data_updated.csv
Loading data to table default.test_tbl
Deleted hdfs://localhost:54310/user/hive/warehouse/test_tbl
OK
Time taken: 0.204 seconds
hive (default)>
![Page 79: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/79.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding Pig
![Page 80: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/80.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionA high-level platform for creating MapReduce programs Using Hadoop
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
![Page 81: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/81.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Components
● Two Compnents● Language (Pig Latin)● Compiler
● Two Execution Environments● Local
pig -x local● Distributed
pig -x mapreduce
Hive.apache.org
![Page 82: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/82.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Running Pig
● Script
pig myscript● Command line (Grunt)
pig● Embedded
Writing a java program
Hive.apache.org
![Page 83: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/83.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Latin
Hive.apache.org
![Page 84: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/84.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Pig Execution Stages
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
![Page 85: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/85.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Why Pig?
● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts
● Provide major functionality required for DatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform
● User can write custom UDFs (User Defined Function)
Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi
![Page 86: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/86.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hive.apache.org
![Page 87: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/87.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running a Pig script
![Page 88: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/88.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig Command Line
[hdadmin@localhost ~]$ pig -x local
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.1 (r1459641) compiled Mar 22 2013, 02:13:53
2013-08-01 10:29:00,027 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hdadmin/pig_1375327740024.log
2013-08-01 10:29:00,066 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hdadmin/.pigbootup not found
2013-08-01 10:29:00,212 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
grunt>
![Page 89: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/89.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting Pig from Hue
![Page 90: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/90.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
countryFilter.pig
A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float, lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;
#Preparing Data
Download hdi-data.csv
#Edit Your Script
[hdadmin@localhost ~]$ cd Downloads/
[hdadmin@localhost ~]$ vi countryFilter.pig
Writing a Pig Script
![Page 91: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/91.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
[hdadmin@localhost ~]$ cd Downloads
[hdadmin@localhost ~]$ pig -x local
grunt > run countryFilter.pig
....
(150,Cameroon,0.482,51,5,10,2031)
(126,Kyrgyzstan,0.615,67,9,12,2036)
(156,Nigeria,0.459,51,5,8,2069)
(154,Yemen,0.462,65,2,8,2213)
(138,Lao People's Democratic Republic,0.524,67,4,9,2242)
(153,Papua New Guinea,0.466,62,4,5,2271)
(165,Djibouti,0.43,57,3,5,2335)
(129,Nicaragua,0.589,74,5,10,2430)
(145,Pakistan,0.504,65,4,6,2550)
Running a Pig Script
![Page 92: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/92.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Lecture: Understanding Sqoop
![Page 93: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/93.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Introduction
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
• Imports individual tables or entire databases to files in HDFS
• Generates Java classes to allow you to interact with your imported data
• Provides the ability to import from SQL databases straight into your Hive data warehouse
See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
![Page 94: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/94.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Architecture Overview
Hive.apache.org
![Page 95: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/95.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Loading Data from DBMS to Hadoop HDFS
![Page 96: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/96.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data into MySQL DB
[root@localhost ~]# service mysqld start
Starting mysqld: [ OK ]
[root@localhost ~]# mysql
mysql> create database countrydb;
Query OK, 1 row affected (0.00 sec)
mysql> use countrydb;
Database changed
mysql> create table country_tbl(id INT, country VARCHAR(100));
Query OK, 0 rows affected (0.02 sec)
mysql> LOAD DATA LOCAL INFILE '/tmp/country_data.csv' INTO TABLE country_tbl FIELDS terminated by ',' LINES TERMINATED BY '\r\n';
Query OK, 237 rows affected, 4 warnings (0.00 sec)
Records: 237 Deleted: 0 Skipped: 0 Warnings: 0
![Page 97: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/97.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Loading Data into MySQL DB
mysql> select * from country_tbl;
+------+------------------------------+
| id | country |
+------+------------------------------+
| 93 | Afghanistan |
| 355 | Albania |
| 213 | Algeria |
| 1684 | AmericanSamoa |
| 376 | Andorra |
...
+------+------------------------------+
237 rows in set (0.00 sec)
mysql>
Testing data query from MySQL DB
![Page 98: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/98.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Installing DB driver for Sqoop
[hdadmin@localhost conf]$ su -
Password:
[root@localhost ~]# cp /01_hadoop_workshop/03_sqoop1.4.2/mysql-connector-java-5.0.8-bin.jar /usr/local/sqoop-1.4.2.bin__hadoop-0.20/lib/
[root@localhost ~]# chown hdadmin:hadoop /usr/local/sqoop-1.4.2.bin__hadoop-0.20/lib/mysql-connector-java-5.0.8-bin.jar
[root@localhost ~]# exit
![Page 99: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/99.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Importing data from MySQL to Hive Table
[hdadmin@localhost ~]$ sqoop import --connect jdbc:mysql://localhost/countrydb --username root -P --table country_tbl --hive-import --hive-table country_tbl -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Enter password: <enter here>13/03/21 18:07:43 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
13/03/21 18:07:43 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
13/03/21 18:07:43 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
13/03/21 18:07:43 INFO tool.CodeGenTool: Beginning code generation
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1
13/03/21 18:07:44 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1
13/03/21 18:07:44 INFO orm.CompilationManager: HADOOP_HOME is /usr/local/hadoop/libexec/..
Note: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
13/03/21 18:07:44 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdadmin/compile/0b65b003bf2936e1303f5edf93338215/country_tbl.jar
13/03/21 18:07:44 WARN manager.MySQLManager: It looks like you are importing from mysql.
13/03/21 18:07:44 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
13/03/21 18:07:44 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
13/03/21 18:07:44 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
13/03/21 18:07:44 INFO mapreduce.ImportJobBase: Beginning import of country_tbl
13/03/21 18:07:45 INFO mapred.JobClient: Running job: job_201303211744_0001
13/03/21 18:07:46 INFO mapred.JobClient: map 0% reduce 0%
13/03/21 18:08:02 INFO mapred.JobClient: map 100% reduce 0%
13/03/21 18:08:07 INFO mapred.JobClient: Job complete: job_201303211744_0001
13/03/21 18:08:07 INFO mapred.JobClient: Counters: 18
13/03/21 18:08:07 INFO mapred.JobClient: Job Counters
13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12154
13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/03/21 18:08:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/03/21 18:08:07 INFO mapred.JobClient: Launched map tasks=1
13/03/21 18:08:07 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/03/21 18:08:07 INFO mapred.JobClient: File Output Format Counters
13/03/21 18:08:07 INFO mapred.JobClient: Bytes Written=3252
13/03/21 18:08:07 INFO mapred.JobClient: FileSystemCounters
13/03/21 18:08:07 INFO mapred.JobClient: HDFS_BYTES_READ=87
13/03/21 18:08:07 INFO mapred.JobClient: FILE_BYTES_WRITTEN=30018
13/03/21 18:08:07 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3252
13/03/21 18:08:07 INFO mapred.JobClient: File Input Format Counters
13/03/21 18:08:07 INFO mapred.JobClient: Bytes Read=0
13/03/21 18:08:07 INFO mapred.JobClient: Map-Reduce Framework
13/03/21 18:08:07 INFO mapred.JobClient: Map input records=235
13/03/21 18:08:07 INFO mapred.JobClient: Physical memory (bytes) snapshot=37916672
13/03/21 18:08:07 INFO mapred.JobClient: Spilled Records=0
13/03/21 18:08:07 INFO mapred.JobClient: CPU time spent (ms)=210
13/03/21 18:08:07 INFO mapred.JobClient: Total committed heap usage (bytes)=16252928
13/03/21 18:08:07 INFO mapred.JobClient: Virtual memory (bytes) snapshot=375439360
13/03/21 18:08:07 INFO mapred.JobClient: Map output records=235
13/03/21 18:08:07 INFO mapred.JobClient: SPLIT_RAW_BYTES=87
13/03/21 18:08:07 INFO mapreduce.ImportJobBase: Transferred 3.1758 KB in 22.965 seconds (141.6067 bytes/sec)
13/03/21 18:08:07 INFO mapreduce.ImportJobBase: Retrieved 235 records.
13/03/21 18:08:07 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `country_tbl` AS t LIMIT 1
13/03/21 18:08:07 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hdadmin/country_tbl/_logs
13/03/21 18:08:07 INFO hive.HiveImport: Loading uploaded data into Hive
13/03/21 18:08:08 INFO hive.HiveImport: Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties
13/03/21 18:08:08 INFO hive.HiveImport: Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211808_1458043622.txt
13/03/21 18:08:12 INFO hive.HiveImport: OK
13/03/21 18:08:12 INFO hive.HiveImport: Time taken: 4.079 seconds
13/03/21 18:08:13 INFO hive.HiveImport: Loading data to table default.country_tbl
13/03/21 18:08:13 INFO hive.HiveImport: OK
13/03/21 18:08:13 INFO hive.HiveImport: Time taken: 0.253 seconds
13/03/21 18:08:13 INFO hive.HiveImport: Hive import complete.
13/03/21 18:08:13 INFO hive.HiveImport: Export directory is empty, removing it.
[hdadmin@localhost ~]$
![Page 100: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/100.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing data from Hive Table
[hdadmin@localhost ~]$ hive
Logging initialized using configuration in file:/usr/local/hive-0.9.0-bin/conf/hive-log4j.properties
Hive history file=/tmp/hdadmin/hive_job_log_hdadmin_201303211810_964909984.txt
hive (default)> show tables;
OK
country_tbl
test_tbl
Time taken: 2.566 seconds
hive (default)> select * from country_tbl;
OK
93 Afghanistan
355 Albania
.....
Time taken: 0.587 seconds
hive (default)> quit;
[hdadmin@localhost ~]$
![Page 101: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/101.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Reviewing HDFS Database Table files
Start Web Browser to http://localhost:50070/ then navigate to /user/hive/warehouse
![Page 102: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/102.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
LectureUnderstanding HBase
![Page 103: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/103.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
IntroductionAn open source, non-relational, distributed database
HBase is an open source, non-relational, distributed database modeled after Google's BigTable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (, providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
![Page 104: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/104.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Features
● Hadoop database modelled after Google's Bigtab;e● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike
HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware
Hive.apache.org
![Page 105: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/105.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
When to use Hbase?
● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various
time, time-elapse data)● When you need high scalability
Hive.apache.org
![Page 106: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/106.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Which one to use?
● HDFS● Only append dataset (no random write)● Read the whole dataset (no random read)
● HBase● Need random write and/or read● Has thousands of operation per second on TB+ of data
● RDBMS● Data fits on one big node● Need full transaction support● Need real-time query capabilities
Hive.apache.org
![Page 107: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/107.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
![Page 108: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/108.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
![Page 109: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/109.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Components
Hive.apache.org
● Region● Row of table are stores
● Region Server● Hosts the tables
● Master● Coordinating the Region
Servers● ZooKeeper● HDFS● API
● The Java Client API
![Page 110: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/110.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
HBase Shell Commands
Hive.apache.org
![Page 111: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/111.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Hands-On: Running HBase
![Page 112: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/112.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Starting HBase shell
[hdadmin@localhost ~]$ start-hbase.sh
starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out
[hdadmin@localhost ~]$ jps
3064 TaskTracker
2836 SecondaryNameNode
2588 NameNode
3513 Jps
3327 HMaster
2938 JobTracker
2707 DataNode
[hdadmin@localhost ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013
hbase(main):001:0>
![Page 113: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/113.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Create a table and insert data in HBase
hbase(main):009:0> create 'test', 'cf'
0 row(s) in 1.0830 seconds
hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'
0 row(s) in 0.0750 seconds
hbase(main):011:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1375363287644, value=val1
1 row(s) in 0.0640 seconds
hbase(main):002:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1375363287644, value=val1
1 row(s) in 0.0370 seconds
![Page 114: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/114.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Data Browsers in Hue for HBase
![Page 115: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/115.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Using Data Browsers in Hue for HBase
![Page 116: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/116.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Recommendation to Further Study
![Page 117: Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR](https://reader033.vdocuments.site/reader033/viewer/2022061204/547e859ab37959822b8b5468/html5/thumbnails/117.jpg)
Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop
Thank you
www.imcinstitute.comwww.facebook.com/imcinstitute