pig: data analysis tool in cloud

Post on 20-Jan-2015

4.026 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation in Java One conference in Beijing 2010

TRANSCRIPT

Pig : Data Analysis Tool in the Cloud

Jeff Zhangzjffdu@gmail.comCommitter of Pig in ASF

Agenda

• Background

• What is Pig

• Brief introduction of Pig internals

• Demo

• Q/A

Data Explosion

• Web 2.0

• More digit terminal

What we have for data analysis

• RDBMS (Scalability)

• Parallel RDBMS (Expensive)

• Programming Language (Too complex)

• Hadoop MapReduce (Still too complex for non-hadoop users)

Then, Pig’s Coming

What is Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

A simple example of Pig-Latin

raw_data = load '/java_one/pv' Using PigStorage(‘,') as (time_stamp : long, url : chararray);

pages = foreach raw_data generate url;pages = group pages by url;pages = foreach pages generate group as url, COUNT(pages.url) as pv;

pages = order pages by pv desc;top10 = limit pages 10;

dump top10;

• Page view

• The most 10 popular pages

1291950309812, http://snda.com/page_1 1291950309822, http://snda.com/page_2 1291950309832, http://snda.com/page_3

….

Operators in Pig-Latin

Load - a = load ‘data’ using PigStorage(‘\t’) as (f1:int ,f2:double,f3:chararray)

Store - store a into ‘/test/output’ using PigStorage(‘,’)

Dump - dump a

Filter - b = filter a by f1 > 0 and f2 == ‘java_one’

Foreach - b = foreach a generate f1, f3

Group - b= group a by f3;

Join - b = Join a by f1, b by f1;

Describe - describe b;

….

Data Structure in Pig

• Cell field in database- Primitive types: int, long, float, double, bytearray, chararrar,nul

- Complex types: map, tuple, databag

• Tuple row– (1, 1.2, “java”)

• DataBag table or view – { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }

How to use Pig

Grunt (Interactive Shell)

Java API

Other languages (in future)

Architecture of Pig

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan LogicalPlan)

Compiler (LogicalPlan PhysiclaPlan MapReducePlan)

ExecutionEngine

PigContext

Hadoop

Grunt (Interactive shell) PigServer (Java API)

Three basic operations of Pig

• Group by

• Join

• Order

How Pig do Group by

(A,1)(B,2)(C,3)(B,4)(B,5)(C,6)(A,7)(E,8)(D,9)

(A,1)(B,2)(C,3)

(B,4)(B,5)(C,6)

(A,7)(E,8)(D,9)

(A,{(A,1),(A,7)})(C,{(C,3),(C,6)})

(E,{(E,8)})

(B,{(B,2),(B,4),(B,5)})(D,{(D,9)})

Data Source Split Mapper Partition Reducer

How Pig do Join

(3,A3)(5,A5)(3,B3)(2,B2)

(2,A2)(4,B4)

((1,A1),(1,B1))((3,A3),(3,B3))((5,A5),(5,B5))

((2,A2)(2,B2))((4,B4),(4,B4))

(1,A1)(4,A4)(3,A3)(5,A5)(2,A2)

(5,B5)(1,B1)(3,B3)(2,B2)(4,B4)

(1,A1)(4,A4)(5,B5)(1,B1)

Data Source Split Mapper Partition Reducer

How Pig do Sort

(100)(200)(900)(50)

(600)(800)(300)(400)

(100)(200)(900)

(50)(600)(800)

(300)(400)

(50)(100)(200)(300)(400)

(600)(800)

Data Source Split Mapper Range Partition Reducer

UDF (User-Defined-Function)

register myudf.jar; raw_data = load ‘/java_one/udf’ as (name:chararray);firstnames = foreach raw_data generate myudf.FirstName (name); store firstnames into ‘/java_one/udf_output’;

public class FirstName extends EvalFunc<String>{

@Override public String exec(Tuple input) throws IOException { String name=input.get(0).toString(); …. return firstname; }}

What Storage Pig Supports

• HDFS– Plain Text– Binary format– Customized format (XML, JSON, Protobuf, Thrift…)

• RDBMS (DBStorage)

• Cassandra (CassandraStorage)

• HBase (HBaseStorage)

What fields can Pig be applied

• Data Analysis

• Text Processing

• ETL

• Machine Learning

Who’s using Pig

More: http://wiki.apache.org/pig/PoweredBy

References

• http://pig.apache.org (Pig official site)

• http://hadoop.apache.org (Hadoop official site)

• https://github.com/zjffdu/RAF-PIG (Rich API for Pig)

Demo

Thank you !Q&A

top related