pig (latin) demo presented by: imranul hoque 1. topics last seminar: – hadoop installation –...
TRANSCRIPT
![Page 1: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/1.jpg)
1
Pig (Latin) Demo
Presented By: Imranul Hoque
![Page 2: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/2.jpg)
2
Topics
• Last Seminar:– Hadoop Installation– Running MapReduce Jobs– MapReduce Code– Status Monitoring
• Today:– Complexity of writing MapReduce programs– Pig Latin and Pig– Pig Installation– Running Pig
![Page 3: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/3.jpg)
3
Example Problem
• Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category
URL Category Pagerank
www.google.com Search Engine 0.9
www.cnn.com News 0.8
www.facebook.com Social Network 0.85
www.foxnews.com News 0.78
www.foo.com Blah 0.1
www.bar.com Blah 0.5
![Page 4: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/4.jpg)
4
Example Problem (cont’d)
• SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6
• MapReduce: ?• Procedural (MapReduce) vs.Declarative (SQL)• Pig Latin: Sweet spot between declarative and
procedural
Pig Latin PigSystem MapReduce Hadoop
![Page 5: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/5.jpg)
5
Pig Latin Solution
urls = LOAD url-table as (url, category, pagerank)
good_urls = FILTER urls BY pagerank > 0.2;
groups = GROUP good_urls BY category;
big_groups = FILTER groups BY COUNT(good_urls) > 10^6;
output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);
For each sufficiently large category find the average pagerank of high-pagerank urls in that category
![Page 6: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/6.jpg)
Features
• Dataflow language– Find the set of urls that are classified as spams but
have a high pagerank score– spam_urls = FILTER urls BY isSpam(url);– culprit_urls = FILTER spam_urls BY pagerank > 0.8;
• User defined function (UDF)• Debugging environment• Nested data model
![Page 7: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/7.jpg)
7
Pig Latin Commandsload Read data from file system.
store Write data to file system.
foreach Apply expression to each record and output one or more records.
filter Apply predicate and remove records that do not return true.
group/cogroup Collect records with the same key from one or more inputs.
join Join two or more inputs based on a key.
order Sort records based on a key.
distinct Remove duplicate records.
union Merge two data sets.
dump Write output to stdout.
limit Limit the number of records.
![Page 8: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/8.jpg)
Pig System
executionplan
Pig Compiler
Cluster
parsedprogram
Parser
user
cross-joboptimizer
Pig Latin program
Map-Reduce
map-red.jobs
MR Compilerjoin
output
filter
X
f( )
Y
![Page 9: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/9.jpg)
9
MapReduce Compiler
![Page 10: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/10.jpg)
10
Pig Pen
• Find users who tend to visit “good” pages
Transformto (user, Canonicalize(url), time)
LoadPages(url, pagerank)
LoadVisits(user, url, time)
Joinurl = url
Groupby user
Transformto (user, Average(pagerank) as avgPR)
FilteravgPR > 0.5
![Page 11: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/11.jpg)
11
Transformto (user, Canonicalize(url), time)
Joinurl = url
Groupby user
Transformto (user, Average(pagerank) as avgPR)
FilteravgPR > 0.5
LoadPages(url, pagerank)
LoadVisits(user, url, time)
(Amy, 0.65)
(Amy, 0.65)(Fred, 0.4)
(Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) })(Fred, { (Fred, www.snails.com, 11am, 0.4) })
(Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4)(Fred, www.snails.com, 11am, 0.4)
(Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am)(Fred, www.snails.com/index.html, 11am)
(Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am)(Fred, www.snails.com, 11am)
(www.cnn.com, 0.9) (www.snails.com, 0.4)
Challenges?
![Page 12: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/12.jpg)
12
Installation
• Extract• Build (ant)– In pig-0.1.1 and in tutorial dir
• Environment variable– PIGDIR=~/pig-0.1.1– HADOOPSITEPATH=~/hadoop-0.18.3/conf
![Page 13: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/13.jpg)
13
Running Pig
• Two modes:– Local mode– Hadoop mode
• Three ways to execute:– Shell (grunt)– Script– API (currently Java)– GUI (future work)
![Page 14: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/14.jpg)
14
Running Pig (2)
• Save data into HDFS– bin/hadoop -copyFromLocal excite-small.log
excite-small.log• Launch shell/Run script– java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH
org.apache.pig.Main -x mapreduce <script_name>• Our script: – script1-hadoop.pig
![Page 15: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring](https://reader036.vdocuments.site/reader036/viewer/2022072112/56649cfa5503460f949cbd54/html5/thumbnails/15.jpg)
15
Conclusion
• For more details:– http://hadoop.apache.org/core/– http://wiki.apache.org/hadoop/– http://hadoop.apache.org/pig/– http://wiki.apache.org/pig/