yarn by default (spark on yarn)

YARN by default(Spark on YARN)

Ferran Galí i Reniu@ferrangali

About me

@ferrangali

100 MB/s

2 TB = 3.5 hours

The Big Data problem

100 MB/s

2 TB = 30 min

The Big Data problem

Source: Google data centers

http://www.google.com/about/datacenters/

http://www.google.com/about/datacenters/

HDFS

Node Node Node Node Node Node Node NodeHardware

HDFS

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

Hardware

Storage

$> hadoop fs -ls

HDFS

$> hadoop fs -lsFound 2 itemsdrwxr-xr-x - hadoop supergroup 0 2015-06-11 11:27 dir-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 file1.txt

$>

HDFS


$> hadoop fs -ls dir

HDFS


$> hadoop fs -ls dirFound 2 items-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file2.txt-rw-r--r-- 1 hadoop supergroup 2198927 2015-06-10 17:22 dir/file3.txt

$>

HDFS



$> hadoop fs -cat dir/file3.txt

HDFS



$> hadoop fs -cat dir/file3.txtline1line2line3line4line5

HDFS

MapReduce



Hardware

Storage

MapReduce



Hardware

Storage

Processing



Hardware

Storage

Processing Job

The Big Data problemData Pipeline

Application

Map

Map

Map

Map

Reduce

Reduce

Reduce

MapReduce Job

Split

Write

Write

Write

Split

Split

Split

map(){ // Your code here}

reduce(){ // Your code here}



Hardware

Storage

Processing Job


Application



Hardware

Storage

Processing Job Job


Application

NodeJobTracker

NodeTaskTracker

MapReduce 1.0 Architecture

NodeTaskTracker

NodeTaskTracker

NodeTaskTracker


NodeTaskTracker

Map Map Map Map

Map Map Map Reduce

Reduce Reduce Reduce Reduce

NodeJobTracker

NodeTaskTracker


NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

NodeJobTracker

NodeTaskTracker


NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Map

NodeJobTracker

NodeTaskTracker


NodeTaskTracker

NodeTaskTracker

NodeTaskTracker

Application

Reduce

Limitations



YARN

Hardware

Storage

Resource Manager

The Big Data problemYARN - Yet Another Resource Negotiator

Processing

Memory

NodeManagerResourceManager

YARN Architecture

NodeManager

x8 x8

x8 x8


YARN Architecture

NodeManager

Applicationx61 core1024MB

x8 x8

x8 x8


YARN Architecture

ApplicationMaster

NodeManager


x8 x8

x8 x8


YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container


Container

x8 x8

x8 x8


YARN Architecture

ApplicationMaster

Map

Map

Map

NodeManager

Map

Map


Map

x8 x8

x8 x8


YARN Architecture

ApplicationMaster

Reduce

Reduce

Reduce

NodeManager

Reduce

Reduce


Reduce

x8 x8

x8 x8


YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container


Application 2x42 cores2048MB

x8 x8

x8 x8


YARN Architecture

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

ApplicationMaster

Container



x8 x8

x8 x8

NodeManager

Container

ResourceManager

YARN Architecture

Container

ApplicationMaster

Container

Container

Container

NodeManager

Container

Container

Container

ApplicationMaster

Container

Container



x8 x8

x8 x8



YARN

Hardware

Storage

Resource Manager

The Big Data problemNew Paradigms

Processing

Application

Batch



YARN

Hardware

Storage

Resource Manager


Processing

Application

In Memory / Streaming



YARN

Hardware

Storage

Resource Manager


Processing ...

Application

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

Improved Data Pipelines

Map

Reduce

Map

Map

Reduce

Map

Reduce

The Big Data problemSpark Job

def main(args: Array[ String]): Unit = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf)

sc.rdd(...).action() sc.stop()}

RDDs


Main

RDDs

Stage Stage Stage

...


Main

RDDs

Stage

Task Task

Task Task

Stage Stage

Task

Task

...


Task

Main

Task Task Task

Task

Task

Task Task

Task Task

Task Task

The Big Data problemSpark Architecture

Executor

Driver

Executor Executor Executor

coordinates stage tasks across executors


Executor

Driver


Main



Executor

Driver


Main

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task

Task Task




YARN

Hardware

Storage

Resource Manager

The Big Data problemSpark on YARN

Processing

The Big Data problemConfiguration

export HADOOP_CONF_DIR="/opt/hadoop/hadoop_install/conf"core-site.xmlhdfs-site.xmlyarn-site.xml

The Big Data problemConfiguration

spark.yarn.jar hdfs:///user/hadoop/libs/spark-assembly.jar# need to upload the file to HDFS

spark.eventLog.enabled truespark.eventLog.dir hdfs:///app-logs/spark/logsspark.yarn.historyServer.address historyserver_host:18080

# needed if you want to analyse finished jobs

The Big Data problemExecutor configuration

spark.executor.cores 4# the number of tasks that will execute each executor

spark.executor.memory 4g# shared memory across all tasks

spark.yarn.executor.memoryOverhead 614m# 15-20% of total executor memory

Executor

Task

Task

Task

Task

The Big Data problemDeployment

$ ./spark-submit--class my.main.class--master {deploy-mode}my-jar.jararg1 arg2 arg3 ...

yarn-client

yarn-cluster

The Big Data problem--master yarn-client

ResourceManager

NodeManager

Client$ ./spark-submit

NodeManager


ResourceManager

NodeManager


Driver

NodeManager


ResourceManager

NodeManager


Container

Container

Container

Driver

NodeManager

Container

Container

Container


ResourceManager

NodeManager


Executor

Executor

Executor

Driver

NodeManager

Executor

Executor

Executor

The Big Data problem--master yarn-cluster

ResourceManager

NodeManager


NodeManager


ResourceManager

NodeManager


Driver

NodeManager

NodeManager


ResourceManagerClient

$ ./spark-submit

Driver

NodeManager


ResourceManager

NodeManager


Container

Driver

Container

NodeManager

Container

Container

Container


ResourceManager

NodeManager


Executor

Driver

Executor

NodeManager

Executor

Executor

Executor

The Big Data problemStatic allocation

spark.executor.instances 10# it will allocate this fixed number of executors

The Big Data problemStatic allocation

The Big Data problemDynamic allocation

spark.shuffle.service.enabled true# also need to install an auxiliary services to the nodemanagers

spark.dynamicAllocation.enabled truespark.dynamicAllocation.initialExecutors 0spark.dynamicAllocation.minExecutors 1spark.dynamicAllocation.maxExecutors 10

# will allocate and release executors as needed

The Big Data problemDynamic allocation



YARN

Hardware

Storage

Resource Manager

The Big Data problemTrovit YARN cluster

Processing

Trovit

Business Intelligence

Search engine

Mailing Push Notifications

Online Media Buying

Questions?

Thank YouFerran Galí i Reniu

@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0

http://www.flaticon.com/authors/freepik

http://www.flaticon.com

http://creativecommons.org/licenses/by/3.0/

yarn by default (spark on yarn)

Technology