pig latin: a not-so-foreign language for data processing

56
Pig Latin: A Not-So- Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08 Presented By Sandeep Patidar Modified from original Pig Latin talk

Upload: alexandra-zane

Post on 31-Dec-2015

33 views

Category:

Documents


3 download

DESCRIPTION

Pig Latin: A Not-So-Foreign Language for Data Processing. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research SIGMOD’08. Presented By Sandeep Patidar. Modified from original Pig Latin talk. Outline. Map-Reduce and the Need for Pig Latin - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew TomkinsYahoo! ResearchSIGMOD’08

Presented BySandeep Patidar

Modified from original Pig Latin talk

Page 2: Pig Latin: A Not-So-Foreign Language for Data Processing

2

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

Page 3: Pig Latin: A Not-So-Foreign Language for Data Processing

3

Data Processing Renaissance Internet companies

swimming in dataE.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

Page 4: Pig Latin: A Not-So-Foreign Language for Data Processing

4

Data Warehousing …?

ScaleScale Often not scalable enough

$ $ $ $$ $ $ $Prohibitively expensive at web scale

• Up to $200K/TB

SQLSQL• Little control over execution method• Query optimization is hard

• Parallel environment• Little or no statistics• Lots of UDFs

Page 5: Pig Latin: A Not-So-Foreign Language for Data Processing

5

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad

Page 6: Pig Latin: A Not-So-Foreign Language for Data Processing

6

Map-Reduce Map : Performs the group by

Reduce : Performs the aggregation

These are two high level declarative primitives to enable parallel processing

Page 7: Pig Latin: A Not-So-Foreign Language for Data Processing

7 Execution overview of Map-Reduce [2]

1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.

1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.

2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.

2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.

3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.

3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.

4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers

4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers

Page 8: Pig Latin: A Not-So-Foreign Language for Data Processing

8 Execution overview of Map-Reduce [2]

5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically

many different key map to the same reduce task.

5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically

many different key map to the same reduce task.

6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final

output file for this reduce partition.

6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final

output file for this reduce partition.

7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back

to the user code.

7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back

to the user code.

Page 9: Pig Latin: A Not-So-Foreign Language for Data Processing

9

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

Page 10: Pig Latin: A Not-So-Foreign Language for Data Processing

10

Map-Reduce Appeal

ScaleScaleScalable due to simpler design

• Only parallelizable operations• No transactions

$ $ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”

SQL SQL

Page 11: Pig Latin: A Not-So-Foreign Language for Data Processing

11

Limitations of Map-Reduce1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

MM RR

MM MM RR MM

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

Page 12: Pig Latin: A Not-So-Foreign Language for Data Processing

12

Pros And Cons Need a high-level, general data flow language

High leveldeclarative language

High leveldeclarative language Low level

procedural languageLow level

procedural language

Page 13: Pig Latin: A Not-So-Foreign Language for Data Processing

13

Enter Pig Latin Need a high-level, general data flow language

Pig LatinPig Latin

Page 14: Pig Latin: A Not-So-Foreign Language for Data Processing

14

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

Page 15: Pig Latin: A Not-So-Foreign Language for Data Processing

15

Pig Latin Example 1Suppose we have a table

urls: (url, category, pagerank)

Simple SQL query that finds,

For each sufficiently large category, the average pagerank of high-pagerank urls in that category

SELECT category, Avg(pagetank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106

Page 16: Pig Latin: A Not-So-Foreign Language for Data Processing

16

Equivalent Pig Latin program good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;

big_groups = FILTER groups BY COUNT(good_urls) > 106 ;

output = FOREACH big_groups GENERATE category,

AVG(good_urls.pagerank);

Page 17: Pig Latin: A Not-So-Foreign Language for Data Processing

17

Data FlowFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2

Group by categoryGroup by category

Filter categoryby count > 106

Filter categoryby count > 106

Foreach categorygenerate avg. pagerank

Foreach categorygenerate avg. pagerank

Page 18: Pig Latin: A Not-So-Foreign Language for Data Processing

18

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

UrlCategor

yPageRan

k

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Page 19: Pig Latin: A Not-So-Foreign Language for Data Processing

19

Data FlowLoad VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10 urls

Foreach categorygenerate top10 urls

Page 20: Pig Latin: A Not-So-Foreign Language for Data Processing

20

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate

top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Page 21: Pig Latin: A Not-So-Foreign Language for Data Processing

21

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

Page 22: Pig Latin: A Not-So-Foreign Language for Data Processing

22

Dataflow Language

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

User specifies a sequence of steps where each step specifies only a single high-level data transformation

Page 23: Pig Latin: A Not-So-Foreign Language for Data Processing

23

Step by step execution Pig Latin program supply an explicit sequence

of operations, it is not necessary that the operations be executed in that order

e.g., Set of urls of pages classified as spam, but have a high pagerank score

isSpam might be an expensive UDFThen, it will be much better to filter

the url by pagerank first.

isSpam might be an expensive UDFThen, it will be much better to filter

the url by pagerank first.

spam_urls = FILTER urls BY isSpam(url);

culprit_urls = FILTER spam_urls BYpagerank > 0.8;

Page 24: Pig Latin: A Not-So-Foreign Language for Data Processing

24

Quick Start and Interoperability

gVisits = group visits by $1;Where $1 uses positional notation to refer second field

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

Operates directly over filesOperates directly over filesSchemas optional;

Can be assigned dynamicallySchemas optional;

Can be assigned dynamically

Page 25: Pig Latin: A Not-So-Foreign Language for Data Processing

25

Nested Data Model Pig Latin has flexible, fully nested data

model (described later)allows complex, non-atomic data

typessuch as sets, map, and tuple.

Nested Model is more closer to programmer than normalization (1NF)

Avoids expensive joins for web-scale data Allows programmer to easily write UDFs

Page 26: Pig Latin: A Not-So-Foreign Language for Data Processing

26

UDFs as First-Class Citizens Used Defined Functions (UFDs) can be

used in every constructLoad, Store, Group, Filter, Foreach

Example 2Suppose we want to find for each category, the top 10 urls according to pagerank

groups = GROUP urls BY category;output = FOREACH groups GENERATE

category, top10(urls);

Page 27: Pig Latin: A Not-So-Foreign Language for Data Processing

27

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

Page 28: Pig Latin: A Not-So-Foreign Language for Data Processing

28

Data Model Atom: contains Simple atomic value

‘alice’ ‘lanker’

‘ipod’AtomAtom TupleTuple

Tuple: sequence of fields Bag: collection of tuple with possible

duplicates

Page 29: Pig Latin: A Not-So-Foreign Language for Data Processing

29

Map: collection of data items, where each item has an associated key through which is can be looked

Page 30: Pig Latin: A Not-So-Foreign Language for Data Processing

30

Pig Latin Commands Specifying Input Data: LOAD

queries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString,

timestamp); Per-tuple Processing: FOREACH

expand_queries = FOREACH queries GENERATE userId,

expandQuery(queryString);

Page 31: Pig Latin: A Not-So-Foreign Language for Data Processing

31

Pig Latin Commands (Cont.) Discarding Unwanted Data: FILTER

real_queries = FILTER queries BY userId neq ‘bot’;

or FILTER queries BY NOT isBot(userId);

Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT

Page 32: Pig Latin: A Not-So-Foreign Language for Data Processing

32 Expressions in Pig Latin

Page 33: Pig Latin: A Not-So-Foreign Language for Data Processing

33

Example of flattening in FOREACH

Page 34: Pig Latin: A Not-So-Foreign Language for Data Processing

34

Pig Latin Commands (Cont.) Getting Related Data Together: COGROUP

Suppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)

grouped_data = COGROUP result BY queryString, revenue BY queryString;

Page 35: Pig Latin: A Not-So-Foreign Language for Data Processing

35 COGROUP versus JOIN

Page 36: Pig Latin: A Not-So-Foreign Language for Data Processing

36

Pig Latin Example 3Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url.url_revenues = FOREACH grouped_data

GENERATE FLATTEN(distributeRevenue(result, revenue));

Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them.

Page 37: Pig Latin: A Not-So-Foreign Language for Data Processing

37

Pig Latin Commands (Cont.) Special case of COGROUP: GROUP

grouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue

GENERATE queryString, SUM(revenue.amount) AS

totalRevenue; JOIN in Pig Latin

join_result = JOIN result BY queryString,revenue BY queryString;

Page 38: Pig Latin: A Not-So-Foreign Language for Data Processing

38

Pig Latin Commands (Cont.) Map-Reduce in Pig Latin

map_result = FOREACH input GENERATE FLATTEN(map(*));

key_group = GROUP map_result BY $0;

output = FOREACH key_group GENERATE reduce(*);

Page 39: Pig Latin: A Not-So-Foreign Language for Data Processing

39

Pig Latin Commands (Cont.) Other Command

UNION : Returns the union of two or more bagsCROSS: Returns the cross productORDER: Orders a bag by the specified field(s)DISTINCT: Eliminates duplicate tuple in a bag

Nested OperationsPig Latin allows some command to nested within a FOREACH command

Page 40: Pig Latin: A Not-So-Foreign Language for Data Processing

40

Pig Latin Commands (Cont.) Asking for Output : STORE

user can ask for the result of a Pig Latin expression sequence to be materialized to a file

STORE query_revenue INTO ‘myoutput’USING myStore();

myStore is custom serializer.For plain text file, it can be omitted

myStore is custom serializer.For plain text file, it can be omitted

Page 41: Pig Latin: A Not-So-Foreign Language for Data Processing

41

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

Page 42: Pig Latin: A Not-So-Foreign Language for Data Processing

42

Implementation

cluster

HadoopMap-Reduce

HadoopMap-Reduce

PigPig

SQL

automaticrewrite +optimize

or

or USER

Pig is open-source.http://incubator.apache.org/pig

Pig is open-source.http://incubator.apache.org/pig

Page 43: Pig Latin: A Not-So-Foreign Language for Data Processing

43

Building a Logical Plan Pig interpreter first parse Pig Latin

command, and verifies that the input files and bags being referred are valid

Builds logical plan for every bag that the user defines

Processing triggers only when user invokes a STORE command on a bag(at that point, the logical plan for that bag is compiled into physical plan and is executed)

Page 44: Pig Latin: A Not-So-Foreign Language for Data Processing

44

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Map-Reduce Plan Compilation

Page 45: Pig Latin: A Not-So-Foreign Language for Data Processing

45

Compilation into Map-ReduceFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2

Group by categoryGroup by category

Filter categoryby count > 106

Filter categoryby count > 106

Foreach categorygenerate avg. pagerank

Foreach categorygenerate avg. pagerank

Every group or join operation forms a map-

reduce boundary

Other operations pipelined into map and reduce phases

Map1

Reduce1

Page 46: Pig Latin: A Not-So-Foreign Language for Data Processing

46

Compilation into Map-ReduceLoad VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by categoryGroup by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Page 47: Pig Latin: A Not-So-Foreign Language for Data Processing

47

Efficiency With Nested Bags (CO)GROUP command places tuples

belonging to the same group into one or more nested bags

System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory

One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation

Page 48: Pig Latin: A Not-So-Foreign Language for Data Processing

48

Debugging Environment Process of constructing Pig Latin program is

iterative step User makes an initial stab at writing a program Submits it to the system for execution Inspects the output

To avoid this inefficiency, user often create a side data set Unfortunately this method does not always work well

Pig comes with debugging environment called Pig Pen creates side data set automatically

Page 49: Pig Latin: A Not-So-Foreign Language for Data Processing

49

Pig Pen screen shot

Page 50: Pig Latin: A Not-So-Foreign Language for Data Processing

50

Generating a Sandbox Data Set There are three primary objectives in

selecting a sandbox data setRealism: sandbox data set should be subset of the

actual data set

Conciseness: example bags should be as small as possible

Completeness: example bags should be collectively illustrate the key semantics of each command

Page 51: Pig Latin: A Not-So-Foreign Language for Data Processing

51

Usage Scenarios Session Analysis :

Web users sessions, i.e., sequence of page views and clicks made by users, are analyzed.

To calculate How long is the average user session How many links does a user clicks on before leaving website How do click pattern vary in the course of a day/week/month Analysis tasks mainly consist of grouping the activity log by users

and/or website

First production release about a year ago At Yahoo! 30% of all Hadoop jobs are run with

Pig

Page 52: Pig Latin: A Not-So-Foreign Language for Data Processing

52

Related Work Sawzall

Scripting language used at Google on top of map-reduce Rigid structure consisting of a filtering phase followed by

an aggregation phase

DryadLINQ SQL-like language on top of Dryad, used at Microsoft

Nested Data Models Explored before in the context of object-oriented

databases explored data- parallel languages over nested data, e.g.,

NESL

Page 53: Pig Latin: A Not-So-Foreign Language for Data Processing

53

Future Work Safe Optimizer

Performs only high-confidence rewrites User Interface

“Boxes and arrows” GUIPromote collaboration, sharing code

fragments and UDFs External Functions

Tight integration with a scripting language such as Perl or Python

Unified Environment

Page 54: Pig Latin: A Not-So-Foreign Language for Data Processing

54

Summary

Big demand for parallel data processingEmerging tools that do not look like SQL

DBMSProgrammers like dataflow pipes over static

files

Hence the excitement about Map-Reduce

But, Map-Reduce is too low-level and rigidPig LatinSweet spot between map-reduce and SQL

Pig LatinSweet spot between map-reduce and SQL

Page 55: Pig Latin: A Not-So-Foreign Language for Data Processing

55

References C. Olston, B. Reed, U. Srivastava, R.

Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008

J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. OSDI, 2004.

Pig Latin talk at SIGMOD 2008. http://i.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt

Page 56: Pig Latin: A Not-So-Foreign Language for Data Processing

56

Thank you