pig latin - ku leuvenbettina.berendt/... · pig latin includes a small set of carefully chosen...

54

Upload: others

Post on 25-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Pig Latin

Dominique FonteynWim Leers

Universiteit Hasselt

Page 2: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Pig Latin ...

... is an English word game in which we place the �rst letter of aword at the end and add the su�x -ay.

Pig Latin becomes igpay atinlay

banana becomes anana-bay

What does this have to do with computer sciences?

Page 3: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Will the real Pig Latin please stand up?

Pig Latin is a language developed by Yahoo! designed for ad-hocdata analysis.

Combination of

high-level declarative querying (SQL style)

low-level procedural programming (map-reduce)

Page 4: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

First example

Find the average pagerank of high-pagerank URLs for eachsu�ciently large category in a table urls (url, category,pagerank).

SQL:

SELECT category, AVG(pagerank)

FROM urls WHERE pagerank > 0.2

GROUP BY category HAVING COUNT(*) > 106

Page 5: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

First example (2)

Find the average pagerank of high-pagerank URLs for eachsu�ciently large category in a table urls (url, category,pagerank).

PIG LATIN:

good_urls = FILTER urls BY pagerank > 0.2;

groups = GROUP good_urls BY category;

big_groups = FILTER groups BY COUNT(good_urls) >106;output = FOREACH big_groups GENERATE category,

AVG(good_urls.pagerank);

Page 6: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

First example (3)

Pig Latin programs are sequences of steps

Each step carries out a single data transformation

Transformations are fairly high-level

e.g. �ltering, grouping, aggregationlow-level manipulations are unnecessary

Writing Pig Latin programs is similar to specifying a query

execution plan and thus easier for programmers to understandand control how their data is being processed.

Page 7: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

Page 8: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

Page 9: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Data�ow Language

Pig Latin is a high-level data �ow language. The user speci�es asequence of steps. Each step performs only a single, high-level datatransfomation.

It is not necessary that the operations be executed in the order ofthat sequence.

Usage of high-level relational algebra-style primitives like group andfilter allows traditional database optimizations.

Page 10: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Data�ow Language (2)

Find the URLs of all pages that are classi�ed as spam, but have ahigh pagerank.

spam_urls = FILTER urls BY isSpam(url);

culprit_urls = FILTER spam_urls BY pagerank > 0.8;

isSpam() is a user-de�ned function and may be expensive

not the most e�cient method

Page 11: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Data�ow Language (3)

More e�cient would be,

Find the URLs of all pages that are classi�ed as spam, but have ahigh pagerank.

culprit_urls = FILTER urls BY pagerank > 0.8;

spam_urls = FILTER spam_urls BY isSpam(url);

1 get all high pagerank pages �rst

2 invoke isSpam() only on these high pagerank pages

This optimization can be done automatically by the system.

Page 12: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Quick Start and Interoperability

Pig Latin is designed to support ad-hoc data analysis.

queries can be run directly over data �les

the user must provide a function to parse the content intotuples

Similar for output.

Page 13: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Quick Start and Interoperability (2)

Stored schemas are strictly optional.

Schema information can be provided on the �y, or even not at all.

Because ...

Page 14: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Quick Start and Interoperability (2)

Stored schemas are strictly optional.

Schema information can be provided on the �y, or even not at all.

Because ... PIGS EAT ANYTHING!

Page 15: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Nested Data Model

Programmers often think in terms of nested data structures.

Example: Capture information of each pig in a collection of pig

farms.Map<pigFarmId, Set<pig>�>

Page 16: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Nested Data Model (2)

Databases allow only �at tables, i.e., columns are atomic �elds.

pig_farms: (pigFarmId, pigFarmName, ...)

pigs: (pigId, pigName, ...)

pig_info: (pigFarmId, pigId)

Page 17: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Nested Data Model (3)

Pig Latin o�ers a �exible, fully nested data model and allowscomplex, non-atomic data types as �eld or table.

Some reasons for having a nested data model:

closer to how programmers think and thus much more naturalto them than normalization

allows programmers to easily write a rich set of user-de�nedfunctions

Page 18: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

UDFs as First-Class Citizens

Custom processing is a signi�cant part of analysing data.

Pig Latin has extensive support for user-de�ned functions (UDFs).All aspects of Pig Latin processing can be customized through theuse of UDFs.

Input and output of UDFS in Pig Latin follow the nested datamodel. A UDF can take non-atomic parameters as input, and alsooutput non-atomic values.

Page 19: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

UDFs as First-Class Citizens (2)

Example: Find the top 10 URLs according to pagerank for each

category.

groups = GROUP urls BY category;

output = FOREACH groups GENERATE category,

top10(urls);

Here, top10() is a UDF that accepts a set of URLs, and outputs aset containing the top 10 URLs by pagerank for that group.

The �nal output contains non-atomic �elds: there is a tuple foreach category, and one of the �elds is the set of top 10 URLs.

Page 20: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

UDFs as First-Class Citizens (3)

Practical notes

UDFs are written in Java.

Yahoo! is building support for other languages, includingC/C++, Perl (Erlpay) and Python (Ythonpay).

Page 21: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Parallellism Required

Processing web-scale data requires parallelism.

Pig Latin includes a small set of carefully chosen primitivesthat can be easily parallelized.

Other primitives that do not lend themselves to e�cientparallel evaluation have been deliberately excluded.

They can still be carried out by UDFs. The user is then responsiblefor how e�cient his programs are and whether they will beparallelized.

Page 22: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Debugging Environment

Getting a data processing program right usually takes manyiterations. With web-scale data, a single iteration can take manyminutes or hours. The usual run-debug-run cycle can be very slowand ine�cient.

Pig comes with a novel interactive debugging environment thatgenerates concise example data tables illustrating the output ofeach step of the user's program.

Page 23: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Debugging Environment (2)

Page 24: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

Page 25: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Data Model

Pig uses a rich, yet simple data model consisting of 4 types:

Atom

Tuple

Bag

Map

Page 26: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Data Model (3)

Page 27: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Specifying Input Data

The �rst step is to specify what the input data �les are, and howthe �le contents are to be deserialized. We use the LOAD command.

We assume the input �le is a bag, i.e., it contains a sequence oftuples.

Page 28: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Specifying Input Data (2)

queries = LOAD 'query_log.txt'

USING myLoad()

AS (userId, queryString, timestamp);

input �le is query_log.txt

input is converted into tuples by using a custom myLoaddeserializer

loaded tuples have 3 �elds named userId, queryString andtimestamp

Page 29: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Specifying Input Data (3)

queries = LOAD 'query_log.txt'

USING myLoad()

AS (userId, queryString, timestamp);

Both the USING and AS clause are optional.

If no deserializer is speci�ed, Pig uses a default one thatexpects a plain text, tab-delimited �le.

If no schema is used, �elds must be referred to by positioninstead of by name. For readability it is desirable to includeschemas.

Page 30: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Per-tuple Processing

The FOREACH command applies some processing to each tuple of adata set.

expanded_queries = FOREACH queries

GENERATE userID,

expandQuery(queryString);

Each tuple of the bag queries should be processed independentlyto produce an output tuple.

The �rst �eld is the userId �eld of the input tuple.

The second �eld is the result of applying the UDFexpandQuery() to the queryString �eld of the input tuple.

Page 31: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Per-tuple Processing (2)

The GENERATE clause can be followed by a list of expressions. Acommon expression type is �attening. The FLATTEN keywordeliminates nesting by extracting the �elds of the tuples in the bag,and making them �elds of the tuple being output by GENERATE.This removes one level of nesting.

expanded_queries = FOREACH queries

GENERATE userID,

FLATTEN(expandQuery(queryString));

Page 32: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Per-tuple Processing (3)

Page 33: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Per-tuple Processing (4)

Page 34: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Discarding Unwanted Data

The FILTER command discards all data that is not of interest.

Example: Get rid of bot tra�c.

real_queries = FILTER queries BY userId neq 'bot';

comparison operators:

==, !=, <, >, ... (numbers)eq, neq (strings)

logical operators: AND, OR, NOT

Page 35: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Discarding Unwanted Data (2)

We can use UDFs as well.

Example: Get rid of bot tra�c.

real_queries = FILTER queries BY NOT isBot(userId);

Page 36: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together

It is often necessary to group together related tuples from one ormore data sets. This is done with the COGROUP command.

Example: we have 2 data sets speci�ed

results: (queryString, url, position)

revenue: (queryString, adSlot, amount)

Page 37: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together (2)

Example: group together all search result data and revenue data for

the same query string

grouped_data = COGROUP results BY queryString,

revenue BY queryString;

Output: grouped_data: (group, results, revenue)

�rst �eld is the group identi�er, the value of queryString

each next �eld is a bag, one for each input being cogroupedand is named the same as the alias of that input

Page 38: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together (3)

Example: join all search result data and revenue data for the same

query string

join_result = JOIN results BY queryString,

revenue BY queryString;

What is the di�erence with COGROUP?

Page 39: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together (4)

Page 40: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together (5)

When there is only one data set, we use GROUP.

grouped_revenue = GROUP results BY queryString;

Page 41: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Getting Related Data Together - Summarized

When there is one data set

GROUP

When there are two or more data sets

JOIN

COGROUP

JOIN equals a COGROUP followed by FLATTEN

Page 42: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Map-Reduce in Pig Latin

The GROUP and FOREACH statements allow us to express amap-reduce program.

map_result = FOREACH input GENERATE FLATTEN(map(*));

key_groups = GROUP map_result BY $0;

output = FOREACH key_groups GENERATE reduce(*);

Page 43: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Other Commands

Other commands are,

UNION

CROSS

ORDER

DISTINCT

Page 44: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Nested Operations

Each command operates over one or more bags or tuples as input.

When we have nested bags within tuples, we can nest somecommands within a FOREACH command.

grouped_revenue = group revenue BY queryString;

query_revenue = FOREACH grouped_revenue {

top_slot = FILTER revenue

BY adSlot eq 'top';

GENERATE queryString,

SUM(top_slot.amount),

SUM(revenue.amount);

};

Page 45: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Asking for Output

Write results to �le with STORE

STORE query_revenues INTO 'myoutput' USING myStore();

Page 46: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

Page 47: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Implementation

Pig Latin is implemented by the Pig sytem.

Programs are compiled into map-reduce jobs and executed byHadoop.

It is an open source project in the Apache incubator.

Page 48: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Building a Logical Plan

The Pig interpreter �rst parses the Pig Latin commands and veri�esthat the referred input �les and bags are valid.

e.g. when entering c = COGROUP a BY ..., b BY ...,Pig veri�es that a and b are already de�ned

It builds a logical plan for each de�ned bag.

Page 49: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Building a Logical Plan (2)

When de�ning a new bag, the logical plan is constructed bycombining the logical plans for the input bags, and the currentcommand.

e.g. when entering c = COGROUP a BY ..., b BY ...,

The logical plan for c consists of a cogroup command withthe plans for a and b as input.

Page 50: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Building a Logical Plan (3)

When the logical plans are constructed, no processing is carried out.

Processing is only triggered when invoking a STORE command.Then the logical plan is compiled into a physical plan and executed.

This lazy style of execution permits in-memory pipelining and otheroptimizations.

Page 51: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Map-Reduce Plan Compilation

Map-reduce provides the ability to do a large-scale group by

the map tasks assign keys for grouping

the reduce tasks process a group at a time

Page 52: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation

Page 53: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Practical Notes

More information can be found at pig.apache.org.

Pig is a project under active development. New features are to beadded:

safe optimizer

user interfaces

external functions

uni�ed environment

Page 54: Pig Latin - KU Leuvenbettina.berendt/... · Pig Latin includes a small set of carefully chosen primitives that can be easily parallelized. Other primitives that do not lend themselves

Presentation Overview

1 Features and Motivation

2 Pig Latin, the Language

3 Implementation

4 Practical Notes

5 Copresentation