running non- mapreduce applications on apache hadoop

27
© Hortonworks Inc. 2011 Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1

Upload: miracle

Post on 13-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Running Non- MapReduce Applications on Apache Hadoop. Hitesh Shah & Siddharth Seth Hortonworks Inc. Who am I?. Hitesh Shah Member of Technical Staff at Hortonworks Inc. Apache Hadoop PMC member and committer Apache Tez and Apache Ambari PPMC member and committer - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth

Hortonworks Inc.

Page 1

Page 2: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Who am I?• Hitesh Shah

–Member of Technical Staff at Hortonworks Inc.–Apache Hadoop PMC member and committer–Apache Tez and Apache Ambari PPMC member and

committer• Siddharth Seth

–Member of Technical Staff at Hortonworks Inc.–Apache Hadoop PMC member and committer–Apache Tez PPMC member and committer

Page 2Architecting the Future of Big Data

Page 3: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Agenda

•Apache Hadoop v1 to v2

•YARN

•Applications on YARN

•YARN Best PracticesPage 3Architecting the Future of Big Data

Page 4: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Apache Hadoop v1

Page 4Architecting the Future of Big Data

Job ClientSubmit Job JobTracker

TaskTracker

TaskTrackerTaskTrackerTaskTracker

TaskTrackerTaskTrackerTaskTracker

TaskTracker

Map Slot

Reduce Slot

Page 5: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Apache Hadoop v1

•Pros:–A framework to run MapReduce jobs that allows you to run the same piece of code on a single node cluster to one spanning 1000s of machines.

•Cons:–It is a framework to run MapReduce jobs.

Page 5Architecting the Future of Big Data

Page 6: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Apache Giraph

Page 6Architecting the Future of Big Data

• Iterative graph processing on a Hadoop cluster• An iterative approach on MapReduce would require running multiple jobs.• To avoid MR overheads, runs everything as a Map-only job.

Map Task:Master

Map Task:Worker

Map Task:Worker

Map Task:Worker

Map Task:Worker

Map Task:Worker

Map Task:Worker

Map Task:Worker

Map Task:Worker

Page 7: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Apache Oozie

Page 7Architecting the Future of Big Data

• Workflow scheduler system to manage Hadoop jobs.• Running a PIG script through Oozie

JobTrackerOozie

MapTask:Pig Script Launcher

Submit Job

Submit SubsequentMR jobs

1

2 3

Page 8: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Apache Hadoop v2

Page 8Architecting the Future of Big Data

Page 9: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARNThe Operating System of

a Hadoop clusterArchitecting the Future of Big Data Page 9

Page 10: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

The YARN Stack

Page 10Architecting the Future of Big Data

Page 11: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Glossary

Page 11Architecting the Future of Big Data

• Installer–Application Installer or Application Client

•Client–Application Client

•Supervisor–Application Master

•Workers–Application Containers

Page 12: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Architecture

Page 12Architecting the Future of Big Data

ResourceManager

NodeManager

Client

Client

Submit Application

NodeManagerNodeManagerNodeManager

AppMaster Container

Container

Container Container

AppMaster

Container

Container

ContainerContainer

Page 13: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Application Flow

Page 13Architecting the Future of Big Data

Application Client

ResourceManager

Application Master

NodeManager

YarnClient

AppSpecific API

Application ClientProtocol

AMRMClient

NMClient

Application MasterProtocol

ContainerManagement

Protocol

AppContainer

Page 14: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Protocols & Client Libraries• Application Client Protocol: Client to RM interaction

– Library: YarnClient– Application Lifecycle control– Access Cluster Information

• Application Master Protocol: AM – RM interaction– Library: AMRMClient / AMRMClientAsync– Resource negotiation– Heartbeat to the RM

• Container Management Protocol: AM to NM interaction– Library: NMClient/NMClientAsync– Launching allocated containers– Stop Running containers

Page 14Architecting the Future of Big Data

Page 15: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Applicationson YARN

Architecting the Future of Big Data Page 15

Page 16: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Applications

Page 16Architecting the Future of Big Data

• Categorizing Applications– What does the Application do?– Application Lifetime– How Applications accept work– Language

• Application Lifetime– Job submit to complete.– Long-running Services

• Job Submissions– One job : One Application– Multiple jobs per application

Page 17: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Language considerations• Hadoop RPC uses Google Protobuf

–Protobuf bindings: C/C++, GO, Java, Python…

• Accessing HDFS–WebHDFS– libhdfs for C–Python client by Spotify Labs: Snakebite

• YARN Application Logic–ApplicationMaster in Java and containers in any language

Page 17Architecting the Future of Big Data

Page 18: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Tez ( App Submission)

Page 18Architecting the Future of Big Data

• Distributed Execution framework – computation is expressed as a DAG• Takes MapReduce to the next level – where each job was limited to a

Map and/or Reduce stage.

YARN Tasks

Resource Manager

• DAG execution logic• Task co-ordination• Local Task Scheduling

Tez AM

Node Manager(s)

Launch AM

AM Launched

• Job Submission• Monitoring

Tez Client

Request ResourcesAllocated Resources

Launch TasksLaunch AM

Tasks Launched

Heartbeat

Submit DAG

Page 19: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

HOYA ( Long Running App )

Page 19Architecting the Future of Big Data

• On Demand HBase cluster setup• Share cluster resources – persist and shutdown the cluster when not

needed• Dynamically handles Node failures• Allows re-sizing of a running HBase cluster

Page 20: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Resource Manager

Node Manager(s)

YARN

Get New Containers

Kafka (Streams)

Samza AM

Task ContainerTask Task

Task ContainerTask Task

Task ContainerTask Task

Container Finished

Launch Container

Samza on YARN ( Failure Handling App )

Page 20Architecting the Future of Big Data

• Stream processing system – uses YARN as the execution framework• Makes use of CGroups support in YARN for CPU isolation• Uses Kafka as underlying store

Page 21: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARN Eco-system

Page 21Architecting the Future of Big Data

Powered by YARN• Apache Giraph – Graph Processing• Apache Hama - BSP• Apache Hadoop MapReduce – Batch• Apache Tez – Batch/Interactive • Apache S4 – Stream Processing• Apache Samza – Stream Processing• Apache Storm – Stream Processing• Apache Spark – Iterative/Interactive

applications• Cloudera Llama• DataTorrent• HOYA – HBase on YARN• RedPoint Data Management

YARN Utilities/Frameworks• Weave by Continuity• REEF by Microsoft• Spring support for Hadoop 2

Page 22: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

YARNBest Practices

Architecting the Future of Big Data Page 22

Page 23: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Best Practices

Page 23Architecting the Future of Big Data

• Use provided Client libraries• Resource Negotiation

– You may ask but you may not get what you want - immediately.– Locality requests may not always be met. – Resources like memory/CPU are guaranteed.

• Failure handling– Remember, anything can fail ( or YARN can pre-empt your containers)– AM failures handled by YARN but container failures handled by the

application.• Checkpointing

– Check-point AM state for AM recovery.– If tasks are long running, check-point task state.

Page 24: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Best Practices

Page 24Architecting the Future of Big Data

• Cluster Dependencies–Try to make zero assumptions on the cluster.–Your application bundle should deploy everything required using

YARN’s local resources.• Client-only installs if possible

–Simplifies cluster deployment, and multi-version support• Securing your Application

–YARN does not secure communications between the AM and its containers.

Page 25: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Testing/Debugging your Application

Page 25Architecting the Future of Big Data

• MiniYARNCluster–Regression tests

• Unmanaged AM–Support to run the AM outside of a YARN cluster for manual

testing.• Logs

–Log aggregation support to push all logs into HDFS–Accessible via CLI, UI.

Page 26: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Future work in YARN

Page 26Architecting the Future of Big Data

• ResourceManager High Availability and Work-preserving restart– Work-in-Progress

• Scheduler Enhancements– SLA Driven Scheduling, Gang scheduling– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades• Long running services

– Better support to running services like HBase– Discovery of services, upgrades without downtime

• More utilities/libraries for Application Developers– Failover/Checkpointing

• http://hadoop.apache.org

Page 27: Running Non- MapReduce  Applications on Apache  Hadoop

© Hortonworks Inc. 2011

Questions?

Architecting the Future of Big Data Page 27