building distributed processing system from scratch - part 2

21
Distributed Systems from Scratch - Part 2 Handling third party libraries https://github.com/phatak-dev/distributedsystems

Upload: datamantra

Post on 08-Jan-2017

405 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building distributed processing system from scratch - Part 2

Distributed Systems from Scratch - Part 2Handling third party libraries

https://github.com/phatak-dev/distributedsystems

Page 2: Building distributed processing system from scratch - Part 2

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Building distributed processing system from scratch - Part 2

Agenda● Idea● Motivation● Architecture of existing big data system● Function abstraction● Third party libraries● Implementing third party libraries● MySQL task● Code example

Page 4: Building distributed processing system from scratch - Part 2

Idea

“What it takes to build a distributed processing system

like Spark?”

Page 5: Building distributed processing system from scratch - Part 2

Motivation● First version of Spark only had 1600 lines of Scala code● Had all basic pieces of RDD and ability to run

distributed system using Mesos● Recreating the same code with step by step

understanding ● Ample of time in hand

Page 6: Building distributed processing system from scratch - Part 2

Distributed systems from 30000ft

Distributed Storage(HDFS/S3)

Distributed Cluster management(YARN/Mesos)

Distributed Processing Systems(Spark/MapReduce)

Data Applications

Page 7: Building distributed processing system from scratch - Part 2

Our distributed system

Mesos

Scala function based abstraction

Scala functions to express logic

Page 8: Building distributed processing system from scratch - Part 2

Function abstraction● The whole spark API can be summarized a scala

function which can represented as follow () => T● This scala function can be parallelized and sent over

network to run on multiple systems using mesos● The function is represented as a task inside the

framework● FunctionTask.scala

Page 9: Building distributed processing system from scratch - Part 2

Spark API as distributed function● Initial API of the spark revolved around scala function

abstraction for processing as with RDD for data abstraction

● Every API like map, flatMap represented as a function task which takes one parameter and return one value

● The distribution of the functions are initially done by the mesos which later ported to other cluster management

● This shows how the spark started with functional programming

Page 10: Building distributed processing system from scratch - Part 2

Till now● Discussion about Mesos and its abstraction● Hello world code on Mesos● Defining Function interface● Implementing

○ Scheduler to run scala code○ Custom executor for scala○ Serialize and Deserialize scala function

● https://www.youtube.com/watch?v=Oy9ToN4O63c

Page 11: Building distributed processing system from scratch - Part 2

What a local function can do?● Access to the local data. Even in spark, normally the

function access the hdfs local data● Ability to access the classes provided by the framework● Any logic which can be serializedWhat it cannot do?● Access classes outside from the framework● Access the results of other functions (shuffle)● Access to lookup data (broadcast)

Page 12: Building distributed processing system from scratch - Part 2

Need of third party libraries● Ability to add third party libraries in a distributed system

framework is important● Third party libraries allow us to

○ Connect to third party sources○ Use library to implement custom logic like matrix

manipulation inside function abstraction○ Ability to extend base framework using set of

libraries ex: spark-sql○ Ability to optimize for specific hardware

Page 13: Building distributed processing system from scratch - Part 2

Approaches to third party libraries● There are two different approaches to distribute third

party jars● UberJar - Build all the dependencies with your

application code to single jar● Second approach is to distribute the libraries separately

and adding them to the classpath of executors● UberJar suffers from issues of jar size and versioning● So we are going follow second approach which is

similar to one followed in Spark

Page 14: Building distributed processing system from scratch - Part 2

Design for distributing jars

Executor 1

Executor 2

Jar serving http server

Scheduler code

Scheduler/Driver

Download jars over http

Download jars over http

Page 15: Building distributed processing system from scratch - Part 2

Distributing jars● Third party jars are distributed over http protocol over

the cluster● Whenever the scheduler/drives comes up it starts a http

server to serve the jars passed on to it by user● Whenever executors are created, scheduler passes on

the uri of the http server to connect● Executors connect to the jar server and download the

jars to respective machine. Then they add them to their classpath.

Page 16: Building distributed processing system from scratch - Part 2

Code for implementing ● We need multiple changes to our existing code base to

support third party jars● The following are the different steps

○ Implementation of embedded http server○ Change to scheduler to start http server○ Change to executor to download jars and add it to

classpath○ A function which uses third party library

Page 17: Building distributed processing system from scratch - Part 2

Http Server● We implement an embedded http server using jetty● Jetty is a popular http server and J2EE servlet container

from eclipse organization● One of the strength of jetty is it can be embedded inside

another program to provide http interfaces to certain functionality

● Initial versions of Spark used jetty for jar distribution. Newer version uses netty.

● https://eclipse.org/jetty/● HttpServer.scala

Page 18: Building distributed processing system from scratch - Part 2

Scheduler change● Once we have http server, now we need to start when

we start our scheduler● We will use registered callback for creating our jar

server.● As part of starting the jar server, we will copy all the jars

provided by the user to a location which will beame base director for the server.

● Once we have the server running, we pass on the server uri to all the executors

● TaskScheduler.scala

Page 19: Building distributed processing system from scratch - Part 2

Executor side● In executor, we download the jars using calls to the jar

server running on master● Once we downloaded the jars, we add it the classpath

using URLClassLoader● We use above classloader to run our functions so that it

has access all the jars● We plug this code in the registered callback of the

executor so it run only once● TaskExecutor.scala

Page 20: Building distributed processing system from scratch - Part 2

MySQL function● This example is a function which access the mysql class

to run jdbc against a mysql instance● We ship mysql jar using our jar distributed framework so

it will be not part of our application jar● There is no change in our function api as it’s a normal

function as other examples● MySQLTask.scala