oozie hug may12

31
Oozie: Towards a Scalable Workflow Management System for Hadoop Mohammad Islam And Virag Kothari

Upload: mislam77

Post on 26-Jan-2015

120 views

Category:

Technology


1 download

DESCRIPTION

Presented at HUG, 2012

TRANSCRIPT

Page 1: Oozie HUG May12

Oozie: Towards a Scalable Workflow Management System for Hadoop

Mohammad Islam And

Virag Kothari

Page 2: Oozie HUG May12

Accepted Paper

•  Workshop in ACM/SIGMOD, May 2012. •  It is a team effort!

Mohammad Islam Angelo Huang Mohamed Battisha Michelle Chiang Santhosh Srinivasan Craig Peters

Andreas Neumann Alejandro Abdelnur

Page 3: Oozie HUG May12

Presentation Workflow

Page 4: Oozie HUG May12

Installing Oozie

Step 2: Unpack the tarball tar –xzvf <PATH_TO_OOZIE_TAR>

Step 3: Run the setup script bin/oozie-setup.sh -hadoop 0.20.200 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip

Step 4: Start oozie bin/oozie-start.sh

Step 5: Check status of oozie bin/oozie admin -oozie http://localhost:11000/oozie -status

Step 1: Download the Oozie tarball curl -O http://mirrors.sonic.net/apache/incubator/oozie/oozie-3.1.3-incubating/oozie-3.1.3-incubating-distro.tar.gz

Page 5: Oozie HUG May12

Running an Example

<workflow –app name =..> <start..> <action> <map-reduce> …… …… </workflow>

Workflow.xml

$ hadoop jar /usr/joe/hadoop-examples.jar org.myorg.wordcount inputDir outputDir

•  Standalone Map-Reduce job

Start MapReduce wordcount End

OK

Kill

ERROR

Example DAG

•  Using Oozie

Page 6: Oozie HUG May12

<action name=’wordcount'> <map-reduce> <configuration> <property> <name>mapred.mapper.class</name>

<value>org.myorg.WordCount.Map</value> </property>

<property> <name>mapred.reducer.class</name>

<value>org.myorg.WordCount.Reduce</value> </property> <property>

<name>mapred.input.dir</name> <value>usr/joe/inputDir </value> </property> <property> <name>mapred.output.dir</name>

<value>/usr/joe/outputDir</value> </property> </configuration>

</map-reduce> </action>

Example Workflow

mapred.mapper.class = org.myorg.WordCount.Map

mapred.reducer.class = org.myorg.WordCount.Reduce

mapred.input.dir = inputDir

mapred.output.dir = outputDir

Page 7: Oozie HUG May12

A Workflow Application

1)   Workflow.xml: Contains job definition

2) Libraries: optional ‘lib/’ directory contains .jar/.so files

3) Properties file: •  Parameterization of Workflow xml •  Mandatory property is oozie.wf.application.path

Three components required for a Workflow:

Page 8: Oozie HUG May12

Workflow Submission Run Workflow Job

$ oozie job –run -config job.properties -oozie http://localhost:11000/oozie/

Workflow ID: 00123-123456-oozie-wrkf-W

Check Workflow Job Status

$ oozie job –info 00123-123456-oozie-wrkf-W -oozie http://localhost:11000/oozie/

----------------------------------------------------------------------- Workflow Name: test-wf App Path: hdfs://localhost:11000/user/your_id/oozie/ Workflow job status [RUNNING] ... ------------------------------------------------------------------------

Page 9: Oozie HUG May12

Key Features and Design Decisions •  Multi-tenant •  Security

– Authenticate every request – Pass appropriate token to Hadoop job

•  Scalability – Vertical: Add extra memory/disk – Horizontal: Add machines

Page 10: Oozie HUG May12

Oozie Job Processing

Oozie Server Sec

ure

Acc

ess

Had

oop

Kerberos

Oozie Security

End user

Job

Page 11: Oozie HUG May12

Oozie-Hadoop Security

Oozie Server Secu

re

Acc

ess

End user

Oozie Security

Had

oop

Job Kerberos

Page 12: Oozie HUG May12

Oozie-Hadoop Security

•  Oozie is a multi-tenant system •  Job can be scheduled to run later •  Oozie submits/maintains the hadoop jobs •  Hadoop needs security token for each

request

Question: Who should provide the security token to hadoop and how?

Page 13: Oozie HUG May12

Oozie-Hadoop Security Contd.

•  Answer: Oozie •  How?

– Hadoop considers Oozie as a super-user – Hadoop does not check end-user

credential – Hadoop only checks the credential of

Oozie process

•  BUT hadoop job is executed as end-user. •  Oozie utilizes doAs() functionality of Hadoop.

Page 14: Oozie HUG May12

User-Oozie Security

Oozie Server Secu

re

Acc

ess

End user

Oozie Security

Had

oop

Kerberos Job

Page 15: Oozie HUG May12

Why Oozie Security?

•  One user should not modify another user’s job

•  Hadoop doesn’t authenticate end–user •  Oozie has to verify its user before passing

the job to Hadoop

Page 16: Oozie HUG May12

How does Oozie Support Security?

•  Built-in authentication – Kerberos – Non-secured (default)

•  Design Decision – Pluggable authentication – Easy to include new type of authentication – Yahoo supports 3 types of authentication.

Page 17: Oozie HUG May12

Job Submission to Hadoop

•  Oozie is designed to handle thousands of jobs at the same time

•  Question : Should Oozie server – Submit the hadoop job directly? – Wait for it to finish?

•  Answer: No

Page 18: Oozie HUG May12

Job Submission Contd.

•  Reason – Resource constraints: A single Oozie process

can’t simultaneously create thousands of thread for each hadoop job. (Scaling limitation)

–  Isolation: Running user code on Oozie server might de-stabilize Oozie

•  Design Decision – Create a launcher hadoop job – Execute the actual user job from the launcher. – Wait asynchronously for the job to finish.

Page 19: Oozie HUG May12

Job Submission to Hadoop

Oozie Server

Hadoop Cluster Job

Tracker

Launcher Mapper

12

4

Actual M/R Job 3

5

Page 20: Oozie HUG May12

Job Submission Contd.

•  Advantages – Horizontal scalability: If load increases, add

machines into Hadoop cluster – Stability: Isolation of user code and system

process •  Disadvantages

– Extra map-slot is occupied by each job.

Page 21: Oozie HUG May12

Production Setup

•  Total number of nodes: 42K+ •  Total number of Clusters: 25+ •  Total number of processed jobs ≈ 750K/month •  Data presented from two clusters •  Each of them have nearly 4K nodes •  Total number of users /cluster = 50

Page 22: Oozie HUG May12

Oozie Usage Pattern @ Y!

0

5

10

15

20

25

30

35

40

45

50

fs java map-reduce pig

Perc

enta

ge

Job type

Distribution of Job Types On Production Clusters

#1 Cluster #2 Cluster

Page 23: Oozie HUG May12

Experimental Setup

•  Number of nodes: 7 •  Number of map-slots: 28 •  4 Core, RAM: 16 GB •  64 bit RHEL •  Oozie Server

– 3 GB RAM –  Internal Queue size = 10 K – # Worker Thread = 300

Page 24: Oozie HUG May12

Job Acceptance

0 200 400 600 800

1000 1200 1400

2 6 10 14 20 40 52 100 120 200 320 640 wor

kflo

ws

Acc

epte

d/M

in

Number of Submission Threads

Workflow Acceptance Rate

Observation: Oozie can accept a large number of jobs

Page 25: Oozie HUG May12

Time Line of a Oozie Job

Time

User submits

Job

Oozie submits to

Hadoop

Job completes at Hadoop

Job completes at Oozie

Preparation Overhead

Completion Overhead

Total Oozie Overhead = Preparation + Completion

Page 26: Oozie HUG May12

Oozie Overhead

0 200 400 600 800

1000 1200 1400 1600 1800

1 Action 5 Actions 10 Actions 50 Actions

Ove

rhea

d in

mill

isec

s

Number of Actions/Workflow

Per Action Overhead

Observation: Oozie overhead is less when multiple actions are in the same workflow.

Page 27: Oozie HUG May12

Oozie Futures

•  Scalability – Hot-Hot/Load balancing service – Replace SQL DB with Zookeeper

•  Improved Usability •  Extend the benchmarking scope •  Monitoring WS API

Page 28: Oozie HUG May12

Take Away ..

• Oozie is –  Easier to use –  Scalable –  Secure and multi-tenant

Page 29: Oozie HUG May12

Q & A

Virag Kothari [email protected]

Mohammad K Islam [email protected]

http://incubator.apache.org/oozie/

Page 30: Oozie HUG May12

Vertical Scalability

•  Oozie asynchronously processes the user request.

•  Memory resident – An internal queue to store any sub-task. – Thread pool to process the sub-tasks.

•  Both items – Fixed in size – Don’t change due to load variations – Size can be configured if needed. Might need

extra memory

Page 31: Oozie HUG May12

Challenges in Scalability

•  Centralized persistence storage – Currently supports any SQL DB (such as

Oracle, MySQL, derby etc) – Each DBMS has respective limitations – Oozie scalability is limited by the underlying

DBMS – Using of Zookeeper might be an option