cooperative data exploration with ipython notebook

Post on 15-Apr-2017

255 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Piotr LusakowskiCooperative Data Exploration

with IPython Notebook

Motivation

1

● Big Data computations require lots of resources

○ CPU○ RAM

● Sharing the results is difficult in most current setups

○ Precomputed datasets○ Trained models○ Insights

Solution Created for the Seahorse 1.0 release

● Single Spark application as the backend○ Results of other team members easily accessible in-memory○ No unnecessary duplication of data

● Multiple IPython Notebooks as clients

2

● How to use the SparkContext and SqlContext of an application running on a cluster?

● How to execute Python code on cluster?

Challenges

3

A library for Python - Java communication

● “Wraps” JVM-based objects

● Exposes their API in Python

● Internally, uses a custom TCP client/server communication

● In JVM: a Gateway Server

● On the Python side: a client called Java Gateway

Py4J

4

● Spark application exposes its SparkContext and SqlContext

○ It’s actually quite easy, once you know what you’re doing

● Notebook connects to the Spark application via Py4J on startup

○ sc and sqlContext variables are added to user’s environment○ This setup is completely transparent to the user

Using an Existing SparkContext

5

Notebook Architecture Overview

6

● User’s code is executed by kernels - processes spawned by the Notebook Server

● Kernels execute user’s code on Notebook Server host

Requirements

7

● User’s code is executed on the Spark driver

● No assumptions about the driver being visible from the Notebook Server

● Forwarding Kernel

● Executing Kernel

● Message Queue

Custom Kernel

8

● Storage object accessible via Py4J

○ Each client connected to the Spark application can reuse any entity from the storage

■ DataFrames■ Models■ Even code snippets

○ Access control■ Sharing with only selected colleagues■ Private storage

○ Notifications: “Hey, look, Susan published a new result!”

The Interaction Between Users

9

● John defines a DataFrame: “Something Interesting”

● Alex explores it

● Susan bases her models on it

● John uses a model shared by Susan

Cooperative Data Exploration

10

Thank you!

Piotr LusakowskiSenior Software Engineer

piotr.lusakowski@deepsense.io

top related