hadoop & spark performance tuning using dr. elephant

Dr. Elephantgithub.com/linkedin/dr-elephant

Akshay RaiHadoop Dev Team

Introduction

Scaling Hadoop Infrastructure

Scale and Optimize Hardware● More users, more jobs, more resources

● Large investment in hardware

● Can’t keep upgrading and adding machines to solve problem forever

● Some tuning is needed to get things running

Users are more valuable than machines

What do we do?

Improve User Productivity

User Productivity● Freedom to experiment and run jobs on the cluster

● Build tools to help developers. (Hadoop DSL, Resolvers for Pig/Hive)

○ Improve developer lifecycle

○ Also reduce unnecessary resource wastage

The Tuning Problem

How easy is it to tune a job?● Problems are not obvious

● Critical information is scattered

● Inter-related settings

● Large parameter space

Here’s what we learned!

Expert Intervention● Not enough support resources available

● Poor coverage

● Difficult to prioritize efforts

● Delays user development

Random

Suggestions

Training is not at all easy● Too many users

● Diverse backgrounds

● Scope is large and evolving

● Other responsibilities are more important

Scaling Productivity is Hard!

Dr. Elephant to the Rescue

What does Dr. Elephant do?● Automated performance monitoring and tuning tool

● Help every user get the best performance from their jobs

● Highlights common mistakes

● Indicates best practices and tuning tips

● Provides a platform for other performance related tools

● Analyzes hundred thousand jobs every day

Architecture

Dashboard

Search

Job Page

MapReduce Report

Failed Job

Help Page

Tuning Tips

Awesome Features

Simplified analysis of a flow’s historical executions● Monitoring performance, resource usage and many others

● Comparing flows against previous executions

● Impact of tuning a specific parameter or a changing a line of code

Flow History

Job History

Heuristics

How does a Heuristic work?● Fetch Counters and Task Data

● Some logic to compute a value

● Compare value against threshold levels

Heuristic Severity

Severity Color Description

CRITICAL The job is in critical state and must be tuned

SEVERE There is scope for improvement

MODERATE There is scope for further improvement

LOW There is scope for few minor improvements

NONE The job is safe. No tuning necessary

Example | Mapper Data Skew

Mapper Skew Problem● Number of Mappers depend on the number of splits

● Varying size of splits can cause skewness in the Mapper Input

Solution to Mapper Skewness● Each Mapper should process the same amount of data

● Combine the small chunks and feed it to a single Mapper

Example | Spark Executor Load Balance

Spark Driver

Executor 1

Executor 2

Executor 3

RDD

Partition 1

Partition 2

Partition 3

Custom Heuristics

Adding a New Heuristic1. Create a new heuristic and test it.

2. Create a new view for the heuristic. For example, helpMapperSpill.scala.html

3. Add the details of the heuristic in the HeuristicConf.xml file.

<heuristic>

<applicationtype>mapreduce</applicationtype>

<heuristicname>Mapper GC</heuristicname>

<classname>com.linkedin.dre.mapreduce.heuristics.MapperGC</classname>

<viewname>views.html.help.mapreduce.helpGC</viewname>

</heuristic>

4. Run Dr. Elephant. It should now include the new heuristics.

Configuring Heuristics/Threshold levels<heuristics>

<heuristic>

<applicationtype>mapreduce</applicationtype>

<heuristicname>Mapper Data Skew</heuristicname>

<classname>com.linkedin.dre.mapreduce.heuristics.MapperDataSkew</classname>

<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>

<params>

<num_tasks_severity>10, 50, 100, 200</num_tasks_severity>

<deviation_severity>2, 4, 8, 16</deviation_severity>

<files_severity>1/8, 1/4, 1/2, 1</files_severity>

</params>

</heuristic>

</heuristics>

Elephagent

Workflow monitoring and reports● Performance characteristics change

○ Data Growth

○ Data distribution change

○ Hardware change

○ Incremental software change

● Monitor performance on each execution

● Compare behaviour across revisions

● Cost to Serve analysis

Production Reviews | JIRA Bot● Separate cluster for critical workloads

● Audit before deployment

● Improved accuracy

● Faster turnaround

● Higher throughput

Future Plans

Upcoming● Job Resource Usage and Wastage

● Job Wait time

● Real time analysis of a job

● Workflow DAG visualization

● Improved Spark heuristics

ReferencesEngineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Open Source Github Link:github.com/linkedin/dr-elephant

Mailing List:Dr-elephant-users

Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA

http://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark



http://github.com/linkedin/dr-elephant

http://github.com/linkedin/dr-elephant

https://groups.google.com/forum/#!forum/dr-elephant-users

https://groups.google.com/forum/#!forum/dr-elephant-users

https://www.youtube.com/watch?v=aL3OJ4YoxPA

https://www.youtube.com/watch?v=aL3OJ4YoxPA

Thank You

hadoop & spark performance tuning using dr. elephant

Data & Analytics