pig: data analysis tool in cloud

Pig : Data Analysis Tool in the Cloud

Jeff Zhangzjffdu@gmail.comCommitter of Pig in ASF

Agenda

• Background

• What is Pig

• Brief introduction of Pig internals

• Demo

• Q/A

Data Explosion

• Web 2.0

• More digit terminal

What we have for data analysis

• RDBMS (Scalability)

• Parallel RDBMS (Expensive)

• Programming Language (Too complex)

• Hadoop MapReduce (Still too complex for non-hadoop users)

Then, Pig’s Coming

What is Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

• Ease of programming

• Optimization opportunities

• Extensibility

• Built upon Hadoop

A simple example of Pig-Latin

raw_data = load '/java_one/pv' Using PigStorage(‘,') as (time_stamp : long, url : chararray);

pages = foreach raw_data generate url;pages = group pages by url;pages = foreach pages generate group as url, COUNT(pages.url) as pv;

pages = order pages by pv desc;top10 = limit pages 10;

dump top10;

• Page view

• The most 10 popular pages

1291950309812, http://snda.com/page_1 1291950309822, http://snda.com/page_2 1291950309832, http://snda.com/page_3

Operators in Pig-Latin

Load - a = load ‘data’ using PigStorage(‘\t’) as (f1:int ,f2:double,f3:chararray)

Store - store a into ‘/test/output’ using PigStorage(‘,’)

Dump - dump a

Filter - b = filter a by f1 > 0 and f2 == ‘java_one’

Foreach - b = foreach a generate f1, f3

Group - b= group a by f3;

Join - b = Join a by f1, b by f1;

Describe - describe b;

Data Structure in Pig

• Cell field in database- Primitive types: int, long, float, double, bytearray, chararrar,nul

- Complex types: map, tuple, databag

• Tuple row– (1, 1.2, “java”)

• DataBag table or view – { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }

How to use Pig

Grunt (Interactive Shell)

Java API

Other languages (in future)

Architecture of Pig

Parser (PigLatinLogicalPlan)

Optimizer (LogicalPlan LogicalPlan)

Compiler (LogicalPlan PhysiclaPlan MapReducePlan)

ExecutionEngine

PigContext

Hadoop

Grunt (Interactive shell) PigServer (Java API)

Three basic operations of Pig

• Group by

• Join

• Order

How Pig do Group by

(A,1)(B,2)(C,3)(B,4)(B,5)(C,6)(A,7)(E,8)(D,9)

(A,1)(B,2)(C,3)

(B,4)(B,5)(C,6)

(A,7)(E,8)(D,9)

(A,{(A,1),(A,7)})(C,{(C,3),(C,6)})

(E,{(E,8)})

(B,{(B,2),(B,4),(B,5)})(D,{(D,9)})

Data Source Split Mapper Partition Reducer

How Pig do Join

(3,A3)(5,A5)(3,B3)(2,B2)

(2,A2)(4,B4)

((1,A1),(1,B1))((3,A3),(3,B3))((5,A5),(5,B5))

((2,A2)(2,B2))((4,B4),(4,B4))

(1,A1)(4,A4)(3,A3)(5,A5)(2,A2)

(5,B5)(1,B1)(3,B3)(2,B2)(4,B4)

(1,A1)(4,A4)(5,B5)(1,B1)

Data Source Split Mapper Partition Reducer

How Pig do Sort

(100)(200)(900)(50)

(600)(800)(300)(400)

(100)(200)(900)

(50)(600)(800)

(300)(400)

(50)(100)(200)(300)(400)

(600)(800)

Data Source Split Mapper Range Partition Reducer

UDF (User-Defined-Function)

register myudf.jar; raw_data = load ‘/java_one/udf’ as (name:chararray);firstnames = foreach raw_data generate myudf.FirstName (name); store firstnames into ‘/java_one/udf_output’;

public class FirstName extends EvalFunc<String>{

@Override public String exec(Tuple input) throws IOException { String name=input.get(0).toString(); …. return firstname; }}

What Storage Pig Supports

• HDFS– Plain Text– Binary format– Customized format (XML, JSON, Protobuf, Thrift…)

• RDBMS (DBStorage)

• Cassandra (CassandraStorage)

• HBase (HBaseStorage)

What fields can Pig be applied

• Data Analysis

• Text Processing

• ETL

• Machine Learning

Who’s using Pig

More: http://wiki.apache.org/pig/PoweredBy

References

• http://pig.apache.org (Pig official site)

• http://hadoop.apache.org (Hadoop official site)

• https://github.com/zjffdu/RAF-PIG (Rich API for Pig)

Thank you !Q&A

pig: data analysis tool in cloud

data explosionweb

data analysis tool

cloud jeff

digit terminal

Technology

using oracle cloud payroll as a strategic tool

guinea-pig: a tool for beam-beam effect study

probability density cloud as a tool to describe statistics

answergen bi analytics on cloud and cloud bi analytics tool

cloud exchange - a smart tool to enhance cloud connectivity...

cloud tool & app presentation edu 36401

cloud & smarter infrastructure channel saas program tool...

posizionamento tool crm on-the-cloud

maas360 cloud extender - forescout · lotus traveler...

presentación de powerpoint€¦ · hive / pig advance...

oleg shaniuk - cloud made - iphone mapping tool sketch

oracle wp cloud candidate tool r3!0!1434931

pig v2.4 audit prep tool - global animal partnership

cat user guide cloud assesment tool

cloud configuration tool user guide - site ::...

a cloud-based analysis tool for vibration monitoring with

trends in cloud computing cloud security readiness tool

higher education cloud vendor assessment tool...by...

an infrastructure modelling tool for cloud …...

a graphical tool for cloud-based building energy...