pig latin: a not-so-foreign language for data processing

Pig Latin: A Not-So-Foreign Language for Data Processing

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew TomkinsYahoo! ResearchSIGMOD’08

Presented BySandeep Patidar

Modified from original Pig Latin talk

2

Outline Map-Reduce and the Need for Pig Latin Pig Latin example Feature and Motivation Pig Latin Implementation Debugging Environment Usage Scenarios Related Work Future Work

3

Data Processing Renaissance Internet companies

swimming in dataE.g. TBs/day at Yahoo!

Data analysis is “inner loop” of product innovation

Data analysts are skilled programmers

4

Data Warehousing …?

ScaleScale Often not scalable enough

$ $ $ $$ $ $ $Prohibitively expensive at web scale

• Up to $200K/TB

SQLSQL• Little control over execution method• Query optimization is hard

• Parallel environment• Little or no statistics• Lots of UDFs

5

New Systems For Data Analysis

Map-Reduce

Apache Hadoop

Dryad

6

Map-Reduce Map : Performs the group by

Reduce : Performs the aggregation

These are two high level declarative primitives to enable parallel processing

7 Execution overview of Map-Reduce [2]

1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.

1) The Map-Reduce library in the user programfirst splits the input les into M pieces of typically16 megabytes to 64 megabytes (MB) per piece.It then starts up many copies of the program ona cluster of machines.

2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.

2) One of the copy of the program is special – the master.The rest are workers that are assigned work by the master.There are M map task and R reduce tasks to assign, TheMaster picks idle worker and assign each one a task.

3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.

3) A worker who is assigned a map task reads the contentsof the corresponding input split. It parses key/value pairsout of the input data and passes each pair to the user-definedMap function. The intermediate key/value pairs producedby the Map function are buffered in memory.

4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers

4) Periodically, the buffered pairs are written to local disk,partitioned into R regions by the partitioning function.The location of these buffered pairs on the local disk arepassed back to the Master, who is responsible forforwarding these locations to the reduce workers

8 Execution overview of Map-Reduce [2]

5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically

many different key map to the same reduce task.

5) When a reduce worker is modified by the master about these locations,it uses remote procedure calls to read buffered data from the local disks ofmap workers. When a reduce worker has read all intermediate data, it sorts itby the intermediate keys. The sorting is needed because typically

many different key map to the same reduce task.

6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final

output file for this reduce partition.

6) The reduce worker iterate over the sorted intermediate dataand for each unique key encountered, it passes the key and the.corresponding set of intermediate values to the user’s Reduce function.The output of the Reduce function is appended to the final

output file for this reduce partition.

7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back

to the user code.

7) When all map task and reduce task have been completed,the master wakes up the user program, At this point, theMap-Reduce call in the user program returns back

to the user code.

9

Inputrecords

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

mapmap

mapmap

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

reducereduce

reducereduce

10

Map-Reduce Appeal

ScaleScaleScalable due to simpler design

• Only parallelizable operations• No transactions

$ $ Runs on cheap commodity hardware

Procedural Control- a processing “pipe”

SQL SQL

11

Limitations of Map-Reduce1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union Split

MM RR

MM MM RR MM

Chains

2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize

12

Pros And Cons Need a high-level, general data flow language

High leveldeclarative language

High leveldeclarative language Low level

procedural languageLow level

procedural language

13

Enter Pig Latin Need a high-level, general data flow language

Pig LatinPig Latin

14


15

Pig Latin Example 1Suppose we have a table

urls: (url, category, pagerank)

Simple SQL query that finds,

For each sufficiently large category, the average pagerank of high-pagerank urls in that category

SELECT category, Avg(pagetank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106

16

Equivalent Pig Latin program good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;

big_groups = FILTER groups BY COUNT(good_urls) > 106 ;

output = FOREACH big_groups GENERATE category,

AVG(good_urls.pagerank);

17

Data FlowFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2

Group by categoryGroup by category

Filter categoryby count > 106


Foreach categorygenerate avg. pagerank


18

Example Data Analysis Task

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Find the top 10 most visited pages in each category

UrlCategor

yPageRan

k

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

19

Data FlowLoad VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url


Foreach categorygenerate top10 urls

Foreach categorygenerate top10 urls

20

In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories generate

top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

21


22

Dataflow Language

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

Jasmine NovakEngineer, Yahoo!

User specifies a sequence of steps where each step specifies only a single high-level data transformation

23

Step by step execution Pig Latin program supply an explicit sequence

of operations, it is not necessary that the operations be executed in that order

e.g., Set of urls of pages classified as spam, but have a high pagerank score

isSpam might be an expensive UDFThen, it will be much better to filter

the url by pagerank first.

isSpam might be an expensive UDFThen, it will be much better to filter

the url by pagerank first.

spam_urls = FILTER urls BY isSpam(url);

culprit_urls = FILTER spam_urls BYpagerank > 0.8;

24

Quick Start and Interoperability

gVisits = group visits by $1;Where $1 uses positional notation to refer second field

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);

Operates directly over filesOperates directly over filesSchemas optional;

Can be assigned dynamicallySchemas optional;

Can be assigned dynamically

25

Nested Data Model Pig Latin has flexible, fully nested data

model (described later)allows complex, non-atomic data

typessuch as sets, map, and tuple.

Nested Model is more closer to programmer than normalization (1NF)

Avoids expensive joins for web-scale data Allows programmer to easily write UDFs

26

UDFs as First-Class Citizens Used Defined Functions (UFDs) can be

used in every constructLoad, Store, Group, Filter, Foreach

Example 2Suppose we want to find for each category, the top 10 urls according to pagerank

groups = GROUP urls BY category;output = FOREACH groups GENERATE

category, top10(urls);

27


28

Data Model Atom: contains Simple atomic value

‘alice’ ‘lanker’

‘ipod’AtomAtom TupleTuple

Tuple: sequence of fields Bag: collection of tuple with possible

duplicates

29

Map: collection of data items, where each item has an associated key through which is can be looked

30

Pig Latin Commands Specifying Input Data: LOAD

queries = LOAD ‘query_log.txt’ USING myLoad() As (userId, queryString,

timestamp); Per-tuple Processing: FOREACH

expand_queries = FOREACH queries GENERATE userId,

expandQuery(queryString);

31

Pig Latin Commands (Cont.) Discarding Unwanted Data: FILTER

real_queries = FILTER queries BY userId neq ‘bot’;

or FILTER queries BY NOT isBot(userId);

Filtering conditions involve combination of expression, comparison operators such as ==, eq, !=, neq, and the logical connectors AND, OR, NOT

32 Expressions in Pig Latin

33

Example of flattening in FOREACH

34

Pig Latin Commands (Cont.) Getting Related Data Together: COGROUP

Suppose we have two data setsresult: (queryString, url, position)revenue: (queryString, adSlot, amount)

grouped_data = COGROUP result BY queryString, revenue BY queryString;

35 COGROUP versus JOIN

36

Pig Latin Example 3Suppose we were trying to attribute search revenue to search-result urls to figure out the monetary worth of each url.url_revenues = FOREACH grouped_data

GENERATE FLATTEN(distributeRevenue(result, revenue));

Where distributeRevenue is a UDF that accepts search results and revenue info for a query string at a time, and outputs a bag of urls and the revenue attributed to them.

37

Pig Latin Commands (Cont.) Special case of COGROUP: GROUP

grouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue

GENERATE queryString, SUM(revenue.amount) AS

totalRevenue; JOIN in Pig Latin

join_result = JOIN result BY queryString,revenue BY queryString;

38

Pig Latin Commands (Cont.) Map-Reduce in Pig Latin

map_result = FOREACH input GENERATE FLATTEN(map(*));

key_group = GROUP map_result BY $0;

output = FOREACH key_group GENERATE reduce(*);

39

Pig Latin Commands (Cont.) Other Command

UNION : Returns the union of two or more bagsCROSS: Returns the cross productORDER: Orders a bag by the specified field(s)DISTINCT: Eliminates duplicate tuple in a bag

Nested OperationsPig Latin allows some command to nested within a FOREACH command

40

Pig Latin Commands (Cont.) Asking for Output : STORE

user can ask for the result of a Pig Latin expression sequence to be materialized to a file

STORE query_revenue INTO ‘myoutput’USING myStore();

myStore is custom serializer.For plain text file, it can be omitted

myStore is custom serializer.For plain text file, it can be omitted

41


42

Implementation

cluster

HadoopMap-Reduce

HadoopMap-Reduce

PigPig

SQL

automaticrewrite +optimize

or

or USER

Pig is open-source.http://incubator.apache.org/pig

Pig is open-source.http://incubator.apache.org/pig

43

Building a Logical Plan Pig interpreter first parse Pig Latin

command, and verifies that the input files and bags being referred are valid

Builds logical plan for every bag that the user defines

Processing triggers only when user invokes a STORE command on a bag(at that point, the logical plan for that bag is compiled into physical plan and is executed)

44

Every group or join operation forms a map-reduce boundary

Other operations pipelined into map and reduce phases

Map-Reduce Plan Compilation

45

Compilation into Map-ReduceFilter good_urlsby pagerank > 0.2Filter good_urlsby pagerank > 0.2






Every group or join operation forms a map-

reduce boundary


Map1

Reduce1

46

Compilation into Map-ReduceLoad VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url


Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Map1

Reduce1Map2

Reduce2

Map3

Reduce3

Every group or join operation forms a map-reduce boundary


47

Efficiency With Nested Bags (CO)GROUP command places tuples

belonging to the same group into one or more nested bags

System can avoid actually materializing these bags, which is specially important when the bags are larger than machine’s main memory

One common case is where user applies a algebraic aggregation function over the result of (CO)GROUP operation

48

Debugging Environment Process of constructing Pig Latin program is

iterative step User makes an initial stab at writing a program Submits it to the system for execution Inspects the output

To avoid this inefficiency, user often create a side data set Unfortunately this method does not always work well

Pig comes with debugging environment called Pig Pen creates side data set automatically

49

Pig Pen screen shot

50

Generating a Sandbox Data Set There are three primary objectives in

selecting a sandbox data setRealism: sandbox data set should be subset of the

actual data set

Conciseness: example bags should be as small as possible

Completeness: example bags should be collectively illustrate the key semantics of each command

51

Usage Scenarios Session Analysis :

Web users sessions, i.e., sequence of page views and clicks made by users, are analyzed.

To calculate How long is the average user session How many links does a user clicks on before leaving website How do click pattern vary in the course of a day/week/month Analysis tasks mainly consist of grouping the activity log by users

and/or website

First production release about a year ago At Yahoo! 30% of all Hadoop jobs are run with

Pig

52

Related Work Sawzall

Scripting language used at Google on top of map-reduce Rigid structure consisting of a filtering phase followed by

an aggregation phase

DryadLINQ SQL-like language on top of Dryad, used at Microsoft

Nested Data Models Explored before in the context of object-oriented

databases explored data- parallel languages over nested data, e.g.,

NESL

53

Future Work Safe Optimizer

Performs only high-confidence rewrites User Interface

“Boxes and arrows” GUIPromote collaboration, sharing code

fragments and UDFs External Functions

Tight integration with a scripting language such as Perl or Python

Unified Environment

54

Summary

Big demand for parallel data processingEmerging tools that do not look like SQL

DBMSProgrammers like dataflow pipes over static

files

Hence the excitement about Map-Reduce

But, Map-Reduce is too low-level and rigidPig LatinSweet spot between map-reduce and SQL

Pig LatinSweet spot between map-reduce and SQL

55

References C. Olston, B. Reed, U. Srivastava, R.

Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD 2008

J. Dean and S. Ghemawat. MapReduce: Simplied data processing on large clusters. In Proc. OSDI, 2004.

Pig Latin talk at SIGMOD 2008. http://i.stanford.edu/~usriv/talks/sigmod08-pig-latin.ppt

56

Thank you

pig latin: a not-so-foreign language for data processing

Documents