ferry - share and deploy big data applications with docker by james horey pydata sv 2014

24
Ferry - Share & Deploy Big Data Applications with Docker James Horey

Upload: pydata

Post on 20-Aug-2015

505 views

Category:

Technology


0 download

TRANSCRIPT

Ferry - Share & Deploy Big Data Applications with Docker

James Horey

• Writing a simple application with Bokeh

• Packaging our application with Docker

• Orchestrating our application with Ferry

Technical material can be found at: https://github.com/jhorey/pydata

Bokeh

U.S. Census

http://api.census.gov/data/2011/acs5?get=DP03_0062E&for=county:*&in=state:06

Median income All counties California

Download some data

Let’s install Bokeh$ pip install bokeh >> Downloading/unpacking bokeh >> SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. $ apt-get install python-dev & pip install bokeh >> "gcc: error trying to exec 'cc1plus': execvp: No such file or directory $ apt-get install g++ $ pip install bokeh

RuntimeError: bokeh sample data directory does not exist, please execute bokeh.sampledata.download()

$ python >>> import bokeh.sampledata

A simple application$ python plot.py Kentucky

Louisville

Let’s share

#!/bin/bash !# Make sure we have ‘pip’ installed apt-get install python-pip !# Install packages in right order apt-get —-yes install g++ python-dev pip install bokeh !# Now download the data python geography.py data/ python population economic Kentucky data/ !# Start the web server python webserver data/

• Your script didn’t work • Oh, I was supposed to run this as

sudo? • Ok, it still didn’t work • I get this funny error • Oh yeah, I’m running Redhat • Ok I’m at my desk, just use my

computer

• Encapsulates applications in isolated containers • Makes it easy and safe to distribute applications • Easy to get started

Our DockerfileStart from a clean Precise image

Install stuff

Add our files

Run this when starting

$ docker build -t ferry/pydata . $ docker push ferry/pydata

Sharing made simple

$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata

p1

Kernel

Hardware

Sharing made simple

$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata $ docker run -p 8001:8000 -name p2 —d ferry/pydata $ docker run -p 8002:8000 -name p3 —d ferry/pydata

p1 p2 p3

Kernel

Hardware

• Containers share basic kernel and H.W. capabilities

• No virtualization

• Containers are isolated • Access via port forwarding

You can run these commands now!

• Highly scalable and fault-tolerant • Great for storing streaming data (sensors,

messages)

CREATE KEYSPACE census WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; !USE census; !CREATE TABLE acs_economic_data ( state_cd TEXT, state_name TEXT, county_cd TEXT, county_name TEXT, median INT, mean INT, capita INT, PRIMARY KEY(count_cd, state_cd) );

Orchestration

Web DB

Web + DB

• Simple • Full control • More work for you

• Simpler Dockerfile • More extensible • How to orchestrate?

• Specify the containers that constitute your application in YAML

• Support for Hadoop, Cassandra, GlusterFS, and OpenMPI

• It’s a little bit like pip for your Docker-based runtime environment

Ferry

http://ferry.opencore.io

Our Application

backend: - storage: personality: "cassandra" instances: 1 connectors: - personality: "ferry/pydata-cassandra" ports: ["8000:8000"]

# The cassandra-client base comes with the various drivers # pre-installed. FROM ferry/cassandra-client NAME ferry/pydata-cassandra !# Place the start scripts in the events directories so they # are started when the connector is brought up. ADD ./scripts/startcas.sh /service/runscripts/start/ ADD ./scripts/restartcas.sh /service/runscripts/restart/ RUN chmod a+x /service/runscripts/start/startcas.sh RUN chmod a+x /service/runscripts/restart/restartcas.sh

+

Easy to share (again)

$ ferry start cassandra.yml sa-df8d0aa6 $ ferry ps UUID Storage Compute Connectors Status Base Time ---- ------- ------- ---------- ------ ---- ---- sa-df8d0aa6 se-54ed4e93 se-a5350a8d running cassandra.yml

$ ferry ssh sa-df8d0aa6 root@client-se-a5350a8d:~# ps -eaf | grep python root 144 1 0 19:49 ? 00:00:00 python /home/ferry/pydata/bokeh/webserver.py /home/ferry/pydata/data

What’s it doing?$ ferry start cassandra.yml

Web C* C*

root@client-se-a5350a8d:~# env | grep BACK BACKEND_STORAGE_TYPE=cassandra BACKEND_STORAGE_IP=10.1.0.12

Generate!Config

What’s it doing?$ ferry start yarn

Client

Y Y

root@client-se-b597cb21:~# env | grep BACK BACKEND_STORAGE_TYPE=gluster BACKEND_STORAGE_IP=10.1.0.18 BACKEND_COMPUTE_TYPE=yarn BACKEND_COMPUTE_IP=10.1.0.15

G G

What’s it doing?$ ferry stop sa-c6cbb572

Client

Y Y

G G

Next steps$ ferry share sa-df8d0aa6

w c* c*

Hardware

w c* c*

Hardware

w c* c*

Hardware

Next steps$ ferry deploy sa-df8d0aa6

w c* c*

Hardware

w

c* c*

Hardware

Hardware Hardware

VPC

EC2

S3

• Even simple applications can be complicated to install and run

• Docker helps quite a bit with this

• Ferry helps build out big data applications

Thank you! !

James [email protected]

!

Ferry ferry.opencore.io @open_core_io