lessons learned running hadoop and spark in docker containers

34
Lessons Learned Running Hadoop and Spark in Docker Thomas Phelan Chief Architect, BlueData @tapbluedata September 29, 2016

Upload: bluedata-inc

Post on 24-Jan-2018

9.189 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Lessons Learned Running Hadoop and Spark in Docker Containers

Lessons Learned Running Hadoop and

Spark in Docker

Thomas Phelan

Chief Architect, BlueData

@tapbluedata

September 29, 2016

Page 2: Lessons Learned Running Hadoop and Spark in Docker Containers

Outline

• Docker Containers and Big Data

• Hadoop and Spark on Docker: Challenges

• How We Did It: Lessons Learned

• Key Takeaways

• Q & A

Page 3: Lessons Learned Running Hadoop and Spark in Docker Containers

A Better Way to Deploy Big Data

Traditional Approach

IT

ManufacturingSalesR&DServices

< 30% utilization

Weeks to build each

cluster

Duplication of dataManagement

complexity

Painful, complex upgrades

Hadoop and Spark on Docker

ManufacturingSalesR&DServices

BI/Analytics Tools

> 90% utilization

A New Approach

No duplication of data

Simplified management

Multi-tenant

Self-service, on-demand

clusters

Simple, instant

upgrades

Page 4: Lessons Learned Running Hadoop and Spark in Docker Containers

Deploying Multiple Big Data Clusters

Data scientists want flexibility:

• Different versions of Hadoop, Spark, et.al.

• Different sets of tools

IT wants control:

• Multi-tenancy

- Data security

- Network isolation

Page 5: Lessons Learned Running Hadoop and Spark in Docker Containers

Containers = the Future of Big Data

Infrastructure

• Agility and elasticity

• Standardized environments (dev, test, prod)

• Portability (on-premises and cloud)

• Higher resource utilization

Applications

• Fool-proof packaging (configs, libraries, driver versions, etc.)

• Repeatable builds and orchestration

• Faster app dev cycles

Page 6: Lessons Learned Running Hadoop and Spark in Docker Containers

The Journey to Big Data on Docker

Start with a clear goal

in sight

Begin with your Docker

toolbox of a single

container and basic

networking and storage

So you want to run Hadoop and Spark on Dockerin a multi-tenant enterprise deployment?

Beware … there is trouble ahead

Page 7: Lessons Learned Running Hadoop and Spark in Docker Containers

Traverse the tightrope of

network configurations

Navigate the river of

container managers

• Swarm ?

• Kubernetes ?

• AWS ECS ?

• Mesos ?

• Overlay files ?

• Flocker ?

• Convoy ?

• Docker Networking ?

Calico

• Kubernetes Networking ?

Flannel, Weave Net

Big Data on Docker: Pitfalls

Cross the desert of storage

configurations

Page 8: Lessons Learned Running Hadoop and Spark in Docker Containers

Big Data on Docker: Challenges

Pass thru the jungle of

software compatibility

Tame the lion of

performance

Finally you get to the top!

Trip down the staircase of

deployment mistakes

Page 9: Lessons Learned Running Hadoop and Spark in Docker Containers

But for deployment in the enterprise, you are

not even close to being done …

Big Data on Docker: Next Steps?

You still have to climb past:

high availability,

backup/recovery, security,

multi-host, multi-container,

upgrades and patches

Page 10: Lessons Learned Running Hadoop and Spark in Docker Containers

You realize it’s time

to get some help!

Big Data on Docker: Quagmire

Page 11: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Design Decisions

• Run Hadoop/Spark distros and applications

unmodified

- Deploy all services that typically run on a single BM

host in a single container

• Multi-tenancy support is key

- Network and storage security

Page 12: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Design Decisions

• Images built to “auto-configure” themselves at

time of instantiation

- Not all instances of a single image run the same set of

services when instantiated

• Master vs. worker cluster nodes

Page 13: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Design Decisions

• Maintain the promise of containers

- Keep them as stateless as possible

- Container storage is always ephemeral

- Persistent storage is external to the container

Page 14: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Implementation

Resource Utilization

• CPU cores vs. CPU shares

• Over-provisioning of CPU recommended

• No over-provisioning of memory

Network

• Connect containers across hosts

• Persistence of IP address across container restart

• Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation

Noisy neighbors

Page 15: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Network Architecture

OVS

ContainerOrchestrator

DHCP/DNS

VxLAN tunnel

NIC

Tenant Networks

OVS

NIC

Resource Manager

Node Manager

Node Manager SparkMasterSparkWorker

Zeppelin

Page 16: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Implementation

Storage• Tweak the default size of a container’s /root

- Resizing of storage inside an existing container is tricky

• Mount logical volume on /data

- No use of overlay file systems

• DataTap (version-independent, HDFS-compliant)

connectivity to external storage

Image Management• Utilize Docker’s image repository

TIP: Mounting block

devices into a container

does not support

symbolic links (IOW:

/dev/sdb will not work,

/dm/… PCI device can

change across host

reboot).

TIP: Docker images can

get large. Use “docker

squash” to save on size.

Page 17: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Security Considerations

• Security is essential since containers and host share one kernel

- Non-privileged containers

• Achieved through layered set of capabilities

• Different capabilities provide different levels of isolation and protection

• Add “capabilities” to a container based on what operations are permitted

Page 18: Lessons Learned Running Hadoop and Spark in Docker Containers

How We Did It: Sample Dockerfile# Spark-1.5.2 docker image for RHEL/CentOS 6.x

FROM centos:centos6

# Download and extract spark

RUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/

# Download and extract scala

RUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/

# Install zeppelin

RUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin

RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*

ADD configure_spark_services.sh /root/configure_spark_services.sh

RUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh

Page 19: Lessons Learned Running Hadoop and Spark in Docker Containers

BlueData Application Image (.bin file)

Application bin file

Dockerimage

CentOS

Dockerfile

RHEL

Dockerfile

appconfig

conf Init.d startscript

<app> logo file

<app>.wb

bdwbcommand

clusterconfig, image, role, appconfig, catalog, service, ..

Sources

Docker file, logo .PNG,

Init.d

RuntimeSoftware Bits

OR

Development(e.g. extract .bin and modify to

create new bin)

Page 20: Lessons Learned Running Hadoop and Spark in Docker Containers

Multi-Host

4 containers

on 2 different hosts

using 1 VLAN and 4

persistent IPs

Page 21: Lessons Learned Running Hadoop and Spark in Docker Containers

Different Services in Each Container

Master Services

Worker Services

Page 22: Lessons Learned Running Hadoop and Spark in Docker Containers

Container Storage from Host

Container Storage Host Storage

Page 23: Lessons Learned Running Hadoop and Spark in Docker Containers

Performance Testing: Spark

• Spark 1.x on YARN

• HiBench - Terasort

- Data sizes: 100Gb, 500GB, 1TB

• 10 node physical/virtual cluster

• 36 cores and112GB memory per node

• 2TB HDFS storage per node (SSDs)

• 800GB ephemeral storage

Page 24: Lessons Learned Running Hadoop and Spark in Docker Containers

Spark on Docker: Performance

MB/s

Page 25: Lessons Learned Running Hadoop and Spark in Docker Containers

Performance Testing: Hadoop

Page 26: Lessons Learned Running Hadoop and Spark in Docker Containers

Hadoop on Docker: Performance

Containers (BlueData EPIC with 1 virtual node per host)

Bare-Metal

Page 27: Lessons Learned Running Hadoop and Spark in Docker Containers

4 Docker

containers

Multiple

Hadoop

versions

Different Hadoop

versions on

same set of

physical hosts

“Dockerized” Hadoop Clusters

Page 28: Lessons Learned Running Hadoop and Spark in Docker Containers

5 fully managed Docker

containers with persistent

IP addresses

“Dockerized” Spark Standalone

Spark with Zeppelin

Notebook

Page 29: Lessons Learned Running Hadoop and Spark in Docker Containers

Big Data on Docker: Key Takeaways

• All apps can be “Dockerized”,

including Hadoop & Spark

- Traditional bare-metal approach to Big

Data is rigid and inflexible

- Containers (e.g. Docker) provide a more

flexible & agile model

- Faster app dev cycles for Big Data app

developers, data scientists, & engineers

Page 30: Lessons Learned Running Hadoop and Spark in Docker Containers

Big Data on Docker: Key Takeaways

• There are unique Big Data pitfalls & challenges with Docker

- For enterprise deployments, you will need to overcome these and more:

Docker base images include Big Data libraries and jar files

Container orchestration, including networking and storage

Resource-aware runtime environment, including CPU and RAM

Page 31: Lessons Learned Running Hadoop and Spark in Docker Containers

Big Data on Docker: Key Takeaways

• There are unique Big Data pitfalls & challenges with Docker

- More:

Access to Container secured with ssh keypair or PAM module

(LDAP/AD)

Fast access to external storage

Management agents in Docker images

Runtime injection of resource and configuration

information

Page 32: Lessons Learned Running Hadoop and Spark in Docker Containers

Big Data on Docker: Key Takeaways

• “Do It Yourself” will be costly and time-consuming

- Be prepared to tackle the infrastructure & plumbing challenges

- In the enterprise, the business value is in the applications / data science

• There are other options …

- Public Cloud – AWS

- BlueData - turnkey solution

Page 33: Lessons Learned Running Hadoop and Spark in Docker Containers
Page 34: Lessons Learned Running Hadoop and Spark in Docker Containers

Thank You

Contact info:

@tapbluedata

[email protected]

www.bluedata.com

www.bluedata.com/free

FREE LITE VERSION