utkarsh srivastava pig : building high-level dataflows over map-reduce research & cloud...
Post on 21-Dec-2015
218 views
TRANSCRIPT
![Page 1: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/1.jpg)
Utkarsh Srivastava
Pig : Building High-Level Dataflows over Map-Reduce
Pig : Building High-Level Dataflows over Map-Reduce
Research &Cloud Computing
![Page 2: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/2.jpg)
Data Processing Renaissance
Internet companies swimming in data• E.g. TBs/day at Yahoo!
Data analysis is “inner loop” of product innovation
Data analysts are skilled programmers
![Page 3: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/3.jpg)
Data Warehousing …?
ScaleScale Often not scalable enough
$ $ $ $$ $ $ $Prohibitively expensive at web scale• Up to $200K/TB
SQLSQL• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs
![Page 4: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/4.jpg)
New Systems For Data Analysis
Map-Reduce
Apache Hadoop
Dryad
. . .
![Page 5: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/5.jpg)
Map-Reduce
Inputrecords
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
mapmap
mapmap
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Outputrecords
reducereduce
reducereduce
Just a group-by-aggregate?Just a group-by-aggregate?
![Page 6: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/6.jpg)
The Map-Reduce Appeal
ScaleScaleScalable due to simpler design• Only parallelizable operations• No transactions
$ $ Runs on cheap commodity hardware
Procedural Control- a processing “pipe”SQL SQL
![Page 7: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/7.jpg)
Disadvantages
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union Split
MM RR
MM MM RR MM
Chains
2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize
![Page 8: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/8.jpg)
Pros And Cons
Need a high-level, general data flow language
![Page 9: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/9.jpg)
Enter Pig Latin
Pig LatinPig Latin
Need a high-level, general data flow language
![Page 10: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/10.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
![Page 11: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/11.jpg)
Example Data Analysis Task
User Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Find the top 10 most visited pages in each category
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Visits Url Info
![Page 12: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/12.jpg)
Data Flow
Load VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate count
Foreach urlgenerate count Load Url InfoLoad Url Info
Join on urlJoin on url
Group by categoryGroup by category
Foreach categorygenerate top10 urls
Foreach categorygenerate top10 urls
![Page 13: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/13.jpg)
In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
![Page 14: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/14.jpg)
Step-by-step Procedural ControlTarget users are entrenched procedural programmers
The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.
The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.
Jasmine NovakEngineer, Yahoo!
• Automatic query optimization is hard • Pig Latin does not preclude optimization
With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.
With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.
David CiemiewiczSearch Excellence, Yahoo!
![Page 15: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/15.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability
Operates directly over filesOperates directly over files
![Page 16: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/16.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability
Schemas optional; Can be assigned dynamically
Schemas optional; Can be assigned dynamically
![Page 17: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/17.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen
User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach
User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach
![Page 18: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/18.jpg)
• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps
• More natural to programmers than flat tuples• Avoids expensive joins
Nested Data Model
yahoo ,financeemailnews
![Page 19: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/19.jpg)
• Common case: aggregation on these nested sets• Power users: sophisticated UDFs, e.g., sequence analysis• Efficient Implementation (see paper)
Nested Data Model
Decouples grouping as an independent operationUser Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy bbc.com 10:05
Fred cnn.com 12:00
group Visits
cnn.comAmy cnn.com 8:00
Fred cnn.com 12:00
bbc.comAmy bbc.com 10:00
Amy bbc.com 10:05
group by url
I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).”
I frankly like pig much better than SQL in some respects (group + optional flatten works better for me, I love nested data structures).”
Ted DunningChief Scientist, Veoh19
![Page 20: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/20.jpg)
CoGroup
query url rank
Lakers nba.com 1
Lakers espn.com 2
Kings nhl.com 1
Kings nba.com 2
query adSlot amount
Lakers top 50
Lakers side 20
Kings top 30
Kings side 10
group results revenue
LakersLakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
KingsKings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
results revenue
Cross-product of the 2 bags would give natural join
![Page 21: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/21.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
![Page 22: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/22.jpg)
Implementation
cluster
Hadoop Map-Reduce
Hadoop Map-Reduce
PigPig
SQL
automaticrewrite +optimize
or
or
user
Pig is open-source.http://hadoop.apache.org/pig
Pig is open-source.http://hadoop.apache.org/pig
• ~50% of Hadoop jobs at Yahoo! are Pig• 1000s of jobs per day
![Page 23: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/23.jpg)
Compilation into Map-Reduce
Load VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate count
Foreach urlgenerate count Load Url InfoLoad Url Info
Join on urlJoin on url
Group by categoryGroup by category
Foreach categorygenerate top10(urls)
Foreach categorygenerate top10(urls)
Map1
Reduce1Map2
Reduce2
Map3
Reduce3
Every group or join operation forms a map-reduce boundary
Other operations pipelined into map and reduce phases
![Page 24: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/24.jpg)
Optimizations: Using the Combiner
Inputrecords
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
mapmap
mapmap
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Outputrecords
reducereduce
reducereduce
Can pre-process data on the map-side to reduce data shipped• Algebraic Aggregation Functions• Distinct processing
![Page 25: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/25.jpg)
Optimizations: Skew Join
• Default join method is symmetric hash join.
group results revenue
LakersLakers nba.com 1 Lakers top 50
Lakers espn.com 2 Lakers side 20
KingsKings nhl.com 1 Kings top 30
Kings nba.com 2 Kings side 10
cross product carried out on 1 reducer
• Problem if too many values with same key• Skew join samples data to find frequent values• Further splits them among reducers
![Page 26: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/26.jpg)
Optimizations: Fragment-Replicate Join
• Symmetric-hash join repartitions both inputs
• If size(data set 1) >> size(data set 2)– Just replicate data set 2 to all partitions of data set 1
• Translates to map-only job– Open data set 2 as “side file”
![Page 27: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/27.jpg)
Optimizations: Merge Join
• Exploit data sets are already sorted.
• Again, a map-only job– Open other data set as “side file”
![Page 28: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/28.jpg)
Optimizations: Multiple Data Flows
Load UsersLoad Users
Filter botsFilter bots
Group by state
Group by state
Apply udfsApply udfs
Store into ‘bystate’
Store into ‘bystate’
Group by demographic
Group by demographic
Apply udfsApply udfs
Store into ‘bydemo’
Store into ‘bydemo’
Map1
Reduce1
![Page 29: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/29.jpg)
Optimizations: Multiple Data Flows
Load UsersLoad Users
Filter botsFilter bots
Group by state
Group by state
Apply udfsApply udfs
Store into ‘bystate’
Store into ‘bystate’
Group by demographic
Group by demographic
Apply udfsApply udfs
Store into ‘bydemo’
Store into ‘bydemo’
SplitSplit
DemultiplexDemultiplex
Map1
Reduce1
![Page 30: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/30.jpg)
Other Optimizations
• Carry data as byte arrays as far as possible
• Using binary comparator for sorting
• “Streaming” data through external executables
![Page 31: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/31.jpg)
Performance
![Page 32: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/32.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin
• Compilation into Map-Reduce
• Example Generation
• Future Work
![Page 33: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/33.jpg)
Example Dataflow Program
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
Find users that tend to visit
high-pagerank pages
![Page 34: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/34.jpg)
Iterative Process
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
Bug in UDFcanonicalize?
Joining on right attribute?
Everything being filtered out?
No Output
![Page 35: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/35.jpg)
How to do test runs?
• Run with real data– Too inefficient (TBs of data)
• Create smaller data sets (e.g., by sampling)– Empty results due to joins [Chaudhuri et. al. 99], and
selective filters
• Biased sampling for joins– Indexes not always present
![Page 36: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/36.jpg)
Examples to Illustrate Program
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
(www.cnn.com, 0.9) (www.frogs.com, 0.3)(www.snails.com, 0.4)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)(Fred, www.snails.com, 0.4)
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
(Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3)
(Fred, www.snails.com, 0.4)
( Amy,
( Fred, )
)
![Page 37: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/37.jpg)
Value Addition From Examples
• Examples can be used for
– Debugging
– Understanding a program written by someone else
– Learning a new operator, or language
![Page 38: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/38.jpg)
Good Examples: Consistency
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
0. Consistency0. Consistency
output example =
operator applied on input example
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 39: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/39.jpg)
Good Examples: Realism
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
1. Realism1. Realism
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 40: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/40.jpg)
Good Examples: Completeness
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
Demonstrate the salient properties of each operator,
e.g., FILTER
2. Completeness2. Completeness
(Amy, 0.6) (Fred, 0.4)
(Amy, 0.6)
![Page 41: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/41.jpg)
Good Examples: Conciseness
LOAD(user, url)
LOAD(url, pagerank)
FOREACHuser, canonicalize(url)
JOINon url
GROUPon user
FOREACHuser, AVG(pagerank)
FILTERavgPR> 0.5
3. Conciseness3. Conciseness
(Amy, cnn.com) (Amy, http://www.frogs.com)(Fred, www.snails.com/index.html)
(Amy, www.cnn.com) (Amy, www.frogs.com)(Fred, www.snails.com)
![Page 42: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/42.jpg)
Implementation Status
• Available as ILLUSTRATE command in open-source release of Pig
• Available as Eclipse Plugin (PigPen)
• See SIGMOD09 paper for algorithm and experiments
![Page 43: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/43.jpg)
Related Work
• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation
• Hive– SQL-like language on top of Map-Reduce
• DryadLINQ– SQL-like language on top of Dryad
• Nested data models– Object-oriented databases
![Page 44: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/44.jpg)
Future / In-Progress Tasks
• Columnar-storage layer• Metadata repository• Profiling and Performance Optimizations• Tight integration with a scripting language– Use loops, conditionals, functions of host language
• Memory Management• Project Suggestions at:
http://wiki.apache.org/pig/ProposedProjects
![Page 45: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/45.jpg)
Credits
![Page 46: Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing](https://reader033.vdocuments.site/reader033/viewer/2022051619/56649d595503460f94a38ee5/html5/thumbnails/46.jpg)
Summary
• Big demand for parallel data processing– Emerging tools that do not look like SQL DBMS– Programmers like dataflow pipes over static files
• Hence the excitement about Map-Reduce
• But, Map-Reduce is too low-level and rigid
Pig LatinSweet spot between map-reduce and SQL
Pig LatinSweet spot between map-reduce and SQL