james horey (opencore.io) ferry - share and deploy big data applications with docker
DESCRIPTION
Ferry is a Python-based, open-source tool to help developers share and run big data applications. Users can provision Hadoop, Cassandra, GlusterFS, and Open MPI clusters locally on their machine using YAML and afterwards distribute their applications via Dockerfiles. These capabilities are useful for data scientists experimenting with big data technologies, developers that need an accessible big data development environment, or for developers simply interested in sharing their big data applications. In this presentation, I’ll introduce you to Docker, show you how to create a simple big data application in Ferry, and discuss ways the Python community can contribute to the open-source project. I’ll also discuss future directions for Ferry with a focus on better application sharing and operational deployments.TRANSCRIPT
Ferry - Share & Deploy Big Data Applications with Docker
James Horey
• Writing a simple application with Bokeh
• Packaging our application with Docker
• Orchestrating our application with Ferry
Technical material can be found at: https://github.com/jhorey/pydata
Bokeh
U.S. Census
http://api.census.gov/data/2011/acs5?get=DP03_0062E&for=county:*&in=state:06
Median income All counties California
Download some data
Let’s install Bokeh$ pip install bokeh >> Downloading/unpacking bokeh >> SystemError: Cannot compile 'Python.h'. Perhaps you need to install python-dev|python-devel. $ apt-get install python-dev & pip install bokeh >> "gcc: error trying to exec 'cc1plus': execvp: No such file or directory $ apt-get install g++ $ pip install bokeh
RuntimeError: bokeh sample data directory does not exist, please execute bokeh.sampledata.download()
$ python >>> import bokeh.sampledata
A simple application$ python plot.py Kentucky
Louisville
Let’s share
#!/bin/bash !# Make sure we have ‘pip’ installed apt-get install python-pip !# Install packages in right order apt-get —-yes install g++ python-dev pip install bokeh !# Now download the data python geography.py data/ python population economic Kentucky data/ !# Start the web server python webserver data/
• Your script didn’t work • Oh, I was supposed to run this as
sudo? • Ok, it still didn’t work • I get this funny error • Oh yeah, I’m running Redhat • Ok I’m at my desk, just use my
computer
• Encapsulates applications in isolated containers • Makes it easy and safe to distribute applications • Easy to get started
Our DockerfileStart from a clean Precise image
Install stuff
Add our files
Run this when starting
$ docker build -t ferry/pydata . $ docker push ferry/pydata
Sharing made simple
$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata
p1
Kernel
Hardware
Sharing made simple
$ docker pull ferry/pydata $ docker run -p 8000:8000 -name p1 —d ferry/pydata $ docker run -p 8001:8000 -name p2 —d ferry/pydata $ docker run -p 8002:8000 -name p3 —d ferry/pydata
p1 p2 p3
Kernel
Hardware
• Containers share basic kernel and H.W. capabilities
• No virtualization
• Containers are isolated • Access via port forwarding
You can run these commands now!
• Highly scalable and fault-tolerant • Great for storing streaming data (sensors,
messages)
CREATE KEYSPACE census WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; !USE census; !CREATE TABLE acs_economic_data ( state_cd TEXT, state_name TEXT, county_cd TEXT, county_name TEXT, median INT, mean INT, capita INT, PRIMARY KEY(count_cd, state_cd) );
Orchestration
Web DB
Web + DB
• Simple • Full control • More work for you
• Simpler Dockerfile • More extensible • How to orchestrate?
• Specify the containers that constitute your application in YAML
• Support for Hadoop, Cassandra, GlusterFS, and OpenMPI
• It’s a little bit like pip for your Docker-based runtime environment
Ferry
http://ferry.opencore.io
Our Application
backend: - storage: personality: "cassandra" instances: 1 connectors: - personality: "ferry/pydata-cassandra" ports: ["8000:8000"]
# The cassandra-client base comes with the various drivers # pre-installed. FROM ferry/cassandra-client NAME ferry/pydata-cassandra !# Place the start scripts in the events directories so they # are started when the connector is brought up. ADD ./scripts/startcas.sh /service/runscripts/start/ ADD ./scripts/restartcas.sh /service/runscripts/restart/ RUN chmod a+x /service/runscripts/start/startcas.sh RUN chmod a+x /service/runscripts/restart/restartcas.sh
+
Easy to share (again)
$ ferry start cassandra.yml sa-df8d0aa6 $ ferry ps UUID Storage Compute Connectors Status Base Time ---- ------- ------- ---------- ------ ---- ---- sa-df8d0aa6 se-54ed4e93 se-a5350a8d running cassandra.yml
$ ferry ssh sa-df8d0aa6 root@client-se-a5350a8d:~# ps -eaf | grep python root 144 1 0 19:49 ? 00:00:00 python /home/ferry/pydata/bokeh/webserver.py /home/ferry/pydata/data
What’s it doing?$ ferry start cassandra.yml
Web C* C*
root@client-se-a5350a8d:~# env | grep BACK BACKEND_STORAGE_TYPE=cassandra BACKEND_STORAGE_IP=10.1.0.12
Generate!Config
What’s it doing?$ ferry start yarn
Client
Y Y
root@client-se-b597cb21:~# env | grep BACK BACKEND_STORAGE_TYPE=gluster BACKEND_STORAGE_IP=10.1.0.18 BACKEND_COMPUTE_TYPE=yarn BACKEND_COMPUTE_IP=10.1.0.15
G G
What’s it doing?$ ferry stop sa-c6cbb572
Client
Y Y
G G
Next steps$ ferry share sa-df8d0aa6
w c* c*
Hardware
w c* c*
Hardware
w c* c*
Hardware
Next steps$ ferry deploy sa-df8d0aa6
w c* c*
Hardware
w
c* c*
Hardware
Hardware Hardware
VPC
EC2
S3
• Even simple applications can be complicated to install and run
• Docker helps quite a bit with this
• Ferry helps build out big data applications