lessons learned running hadoop and spark in docker containers

Lessons Learned Running Hadoop and

Spark in Docker

Thomas Phelan

Chief Architect, BlueData

@tapbluedata

September 29, 2016

Outline

• Docker Containers and Big Data

• Hadoop and Spark on Docker: Challenges

• How We Did It: Lessons Learned

• Key Takeaways

• Q & A

A Better Way to Deploy Big Data

Traditional Approach

IT

ManufacturingSalesR&DServices

< 30% utilization

Weeks to build each

cluster

Duplication of dataManagement

complexity

Painful, complex upgrades

Hadoop and Spark on Docker

ManufacturingSalesR&DServices

BI/Analytics Tools

> 90% utilization

A New Approach

No duplication of data

Simplified management

Multi-tenant

Self-service, on-demand

clusters

Simple, instant

upgrades

Deploying Multiple Big Data Clusters

Data scientists want flexibility:

• Different versions of Hadoop, Spark, et.al.

• Different sets of tools

IT wants control:

• Multi-tenancy

- Data security

- Network isolation

Containers = the Future of Big Data

Infrastructure

• Agility and elasticity

• Standardized environments (dev, test, prod)

• Portability (on-premises and cloud)

• Higher resource utilization

Applications

• Fool-proof packaging (configs, libraries, driver versions, etc.)

• Repeatable builds and orchestration

• Faster app dev cycles

The Journey to Big Data on Docker

Start with a clear goal

in sight

Begin with your Docker

toolbox of a single

container and basic

networking and storage

So you want to run Hadoop and Spark on Dockerin a multi-tenant enterprise deployment?

Beware … there is trouble ahead

Traverse the tightrope of

network configurations

Navigate the river of

container managers

• Swarm ?

• Kubernetes ?

• AWS ECS ?

• Mesos ?

• Overlay files ?

• Flocker ?

• Convoy ?

• Docker Networking ?

Calico

• Kubernetes Networking ?

Flannel, Weave Net

Big Data on Docker: Pitfalls

Cross the desert of storage

configurations

Big Data on Docker: Challenges

Pass thru the jungle of

software compatibility

Tame the lion of

performance

Finally you get to the top!

Trip down the staircase of

deployment mistakes

But for deployment in the enterprise, you are

not even close to being done …

Big Data on Docker: Next Steps?

You still have to climb past:

high availability,

backup/recovery, security,

multi-host, multi-container,

upgrades and patches

You realize it’s time

to get some help!

Big Data on Docker: Quagmire

How We Did It: Design Decisions

• Run Hadoop/Spark distros and applications

unmodified

- Deploy all services that typically run on a single BM

host in a single container

• Multi-tenancy support is key

- Network and storage security


• Images built to “auto-configure” themselves at

time of instantiation

- Not all instances of a single image run the same set of

services when instantiated

• Master vs. worker cluster nodes


• Maintain the promise of containers

- Keep them as stateless as possible

- Container storage is always ephemeral

- Persistent storage is external to the container

How We Did It: Implementation

Resource Utilization

• CPU cores vs. CPU shares

• Over-provisioning of CPU recommended

• No over-provisioning of memory

Network

• Connect containers across hosts

• Persistence of IP address across container restart

• Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation

Noisy neighbors

How We Did It: Network Architecture

OVS

ContainerOrchestrator

DHCP/DNS

VxLAN tunnel

NIC

Tenant Networks

OVS

NIC

Resource Manager

Node Manager

Node Manager SparkMasterSparkWorker

Zeppelin

How We Did It: Implementation

Storage• Tweak the default size of a container’s /root

- Resizing of storage inside an existing container is tricky

• Mount logical volume on /data

- No use of overlay file systems

• DataTap (version-independent, HDFS-compliant)

connectivity to external storage

Image Management• Utilize Docker’s image repository

TIP: Mounting block

devices into a container

does not support

symbolic links (IOW:

/dev/sdb will not work,

/dm/… PCI device can

change across host

reboot).

TIP: Docker images can

get large. Use “docker

squash” to save on size.

How We Did It: Security Considerations

• Security is essential since containers and host share one kernel

- Non-privileged containers

• Achieved through layered set of capabilities

• Different capabilities provide different levels of isolation and protection

• Add “capabilities” to a container based on what operations are permitted

How We Did It: Sample Dockerfile# Spark-1.5.2 docker image for RHEL/CentOS 6.x

FROM centos:centos6

# Download and extract spark

RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/

# Download and extract scala

RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/

# Install zeppelin

RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin

RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*

ADD configure_spark_services.sh /root/configure_spark_services.sh

RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh

BlueData Application Image (.bin file)

Application bin file

Dockerimage

CentOS

Dockerfile

RHEL

Dockerfile

appconfig

conf Init.d startscript

<app> logo file

<app>.wb

bdwbcommand

clusterconfig, image, role, appconfig, catalog, service, ..

Sources

Docker file, logo .PNG,

Init.d

RuntimeSoftware Bits

OR

Development(e.g. extract .bin and modify to

create new bin)

Multi-Host

4 containers

on 2 different hosts

using 1 VLAN and 4

persistent IPs

Different Services in Each Container

Master Services

Worker Services

Container Storage from Host

Container Storage Host Storage

Performance Testing: Spark

• Spark 1.x on YARN

• HiBench - Terasort

- Data sizes: 100Gb, 500GB, 1TB

• 10 node physical/virtual cluster

• 36 cores and112GB memory per node

• 2TB HDFS storage per node (SSDs)

• 800GB ephemeral storage

Spark on Docker: Performance

MB/s

Performance Testing: Hadoop

Hadoop on Docker: Performance

Containers (BlueData EPIC with 1 virtual node per host)

Bare-Metal

4 Docker

containers

Multiple

Hadoop

versions

Different Hadoop

versions on

same set of

physical hosts

“Dockerized” Hadoop Clusters

5 fully managed Docker

containers with persistent

IP addresses

“Dockerized” Spark Standalone

Spark with Zeppelin

Notebook

Big Data on Docker: Key Takeaways

• All apps can be “Dockerized”,

including Hadoop & Spark

- Traditional bare-metal approach to Big

Data is rigid and inflexible

- Containers (e.g. Docker) provide a more

flexible & agile model

- Faster app dev cycles for Big Data app

developers, data scientists, & engineers


• There are unique Big Data pitfalls & challenges with Docker

- For enterprise deployments, you will need to overcome these and more:

Docker base images include Big Data libraries and jar files

Container orchestration, including networking and storage

Resource-aware runtime environment, including CPU and RAM


• There are unique Big Data pitfalls & challenges with Docker

- More:

Access to Container secured with ssh keypair or PAM module

(LDAP/AD)

Fast access to external storage

Management agents in Docker images

Runtime injection of resource and configuration

information


• “Do It Yourself” will be costly and time-consuming

- Be prepared to tackle the infrastructure & plumbing challenges

- In the enterprise, the business value is in the applications / data science

• There are other options …

- Public Cloud – AWS

- BlueData - turnkey solution

Thank You

Contact info:

@tapbluedata

[email protected]

www.bluedata.com

www.bluedata.com/free

FREE LITE VERSION