pig (latin) demo presented by: imranul hoque 1. topics last seminar: – hadoop installation –...

15
Pig (Latin) Demo Presented By: Imranul Hoque 1

Upload: hugh-price

Post on 17-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

1

Pig (Latin) Demo

Presented By: Imranul Hoque

Page 2: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

2

Topics

• Last Seminar:– Hadoop Installation– Running MapReduce Jobs– MapReduce Code– Status Monitoring

• Today:– Complexity of writing MapReduce programs– Pig Latin and Pig– Pig Installation– Running Pig

Page 3: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

3

Example Problem

• Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category

URL Category Pagerank

www.google.com Search Engine 0.9

www.cnn.com News 0.8

www.facebook.com Social Network 0.85

www.foxnews.com News 0.78

www.foo.com Blah 0.1

www.bar.com Blah 0.5

Page 4: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

4

Example Problem (cont’d)

• SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6

• MapReduce: ?• Procedural (MapReduce) vs.Declarative (SQL)• Pig Latin: Sweet spot between declarative and

procedural

Pig Latin PigSystem MapReduce Hadoop

Page 5: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

5

Pig Latin Solution

urls = LOAD url-table as (url, category, pagerank)

good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;

big_groups = FILTER groups BY COUNT(good_urls) > 10^6;

output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

For each sufficiently large category find the average pagerank of high-pagerank urls in that category

Page 6: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

Features

• Dataflow language– Find the set of urls that are classified as spams but

have a high pagerank score– spam_urls = FILTER urls BY isSpam(url);– culprit_urls = FILTER spam_urls BY pagerank > 0.8;

• User defined function (UDF)• Debugging environment• Nested data model

Page 7: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

7

Pig Latin Commandsload Read data from file system.

store Write data to file system.

foreach Apply expression to each record and output one or more records.

filter Apply predicate and remove records that do not return true.

group/cogroup Collect records with the same key from one or more inputs.

join Join two or more inputs based on a key.

order Sort records based on a key.

distinct Remove duplicate records.

union Merge two data sets.

dump Write output to stdout.

limit Limit the number of records.

Page 8: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

Pig System

executionplan

Pig Compiler

Cluster

parsedprogram

Parser

user

cross-joboptimizer

Pig Latin program

Map-Reduce

map-red.jobs

MR Compilerjoin

output

filter

X

f( )

Y

Page 9: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

9

MapReduce Compiler

Page 10: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

10

Pig Pen

• Find users who tend to visit “good” pages

Transformto (user, Canonicalize(url), time)

LoadPages(url, pagerank)

LoadVisits(user, url, time)

Joinurl = url

Groupby user

Transformto (user, Average(pagerank) as avgPR)

FilteravgPR > 0.5

Page 11: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

11

Transformto (user, Canonicalize(url), time)

Joinurl = url

Groupby user

Transformto (user, Average(pagerank) as avgPR)

FilteravgPR > 0.5

LoadPages(url, pagerank)

LoadVisits(user, url, time)

(Amy, 0.65)

(Amy, 0.65)(Fred, 0.4)

(Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) })(Fred, { (Fred, www.snails.com, 11am, 0.4) })

(Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4)(Fred, www.snails.com, 11am, 0.4)

(Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am)(Fred, www.snails.com/index.html, 11am)

(Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am)(Fred, www.snails.com, 11am)

(www.cnn.com, 0.9) (www.snails.com, 0.4)

Challenges?

Page 12: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

12

Installation

• Extract• Build (ant)– In pig-0.1.1 and in tutorial dir

• Environment variable– PIGDIR=~/pig-0.1.1– HADOOPSITEPATH=~/hadoop-0.18.3/conf

Page 13: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

13

Running Pig

• Two modes:– Local mode– Hadoop mode

• Three ways to execute:– Shell (grunt)– Script– API (currently Java)– GUI (future work)

Page 14: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

14

Running Pig (2)

• Save data into HDFS– bin/hadoop -copyFromLocal excite-small.log

excite-small.log• Launch shell/Run script– java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH

org.apache.pig.Main -x mapreduce <script_name>• Our script: – script1-hadoop.pig

Page 15: Pig (Latin) Demo Presented By: Imranul Hoque 1. Topics Last Seminar: – Hadoop Installation – Running MapReduce Jobs – MapReduce Code – Status Monitoring

15

Conclusion

• For more details:– http://hadoop.apache.org/core/– http://wiki.apache.org/hadoop/– http://hadoop.apache.org/pig/– http://wiki.apache.org/pig/