hadoop cluster on docker containers

18
Hadoop Cluster on Docker Containers “What Works and What Doesn't” By: Pranav Joshi ME-HPC GTU PG School

Upload: pranavjoshi

Post on 17-Aug-2015

69 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Hadoop Cluster on Docker Containers

Hadoop Cluster on Docker Containers“What Works and What Doesn't”

By:Pranav JoshiME-HPCGTU PG School

Page 2: Hadoop Cluster on Docker Containers

Content

● Introduction to Hadoop and Docker● Why Hadoop on Docker?● Job Configuration● Openstack Sahara● Handling Hadoop Single Point of Failure● Validating the Prototype● Performance Test● Conclusion● Reference

Page 3: Hadoop Cluster on Docker Containers

Introduction to Hadoop

● Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

● Major Components of Apache Hadoop are,

– Hadoop Common: The common utilities that support the other Hadoop modules.

– Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

– Hadoop YARN: A framework for job scheduling and cluster resource management.

– Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Page 4: Hadoop Cluster on Docker Containers

Introduction to Docker Container

● Docker allows you to package an application with all of its dependencies into a standardized unit for software development.

● It is an open-source program that enables a Linux application and its dependencies to be packaged as a container.

● Containers include the application and all of its dependencies, but share the kernel with other containers.

Page 5: Hadoop Cluster on Docker Containers

Why Docker?

● Lightweight, Portable

● Build once, Run anywhere

● VM – without the overhead of a VM

● Isolated containers

● Automated and scripted

Page 6: Hadoop Cluster on Docker Containers

Separating out simple tasks

Page 7: Hadoop Cluster on Docker Containers

Container vs. VMs

Page 8: Hadoop Cluster on Docker Containers

Job Configuration

● YARN’s ApplicationMaster asks the NodeManager to launch containers: LinuxContainerExecutor

● Docker can be used not only for fine-grained performance isolation, but for delivering software packages

Page 9: Hadoop Cluster on Docker Containers

Openstack Sahara

Page 10: Hadoop Cluster on Docker Containers

Design and Implementation

● Implementation:

– Using a Dockerfile, our solution creates an image with Java, ssh and some basic packages installed, and set up the image to use the Hadoop build in a shared folder with the host.

– When an instance is created from the image, it starts ssh daemon by default in order to allow further runtime configuration through this channel.

● Management:

– Cluster managing library offers an even more abstract API allowing the client to list and create a cluster, start, stop and get details of a container and starting service in a specific container.

Page 11: Hadoop Cluster on Docker Containers

Hadoop and Fault Tolerance

● HDFS allows the replication of the NameNode (through passive replication), but a failure at the level of the Job-Tracker forces a job to be restarted.

● On Hadoop 2.x, part of the job management responsibility is transferred to the ApplicationMaster, which becomes a task manager.

● The loss of the ResourceManager does not block the execution of a job, only prevents new jobs to be submitted. However, the loss of an ApplicationMaster forces the restart of the job, just like on Hadoop l.x.

Page 12: Hadoop Cluster on Docker Containers

Handling Hadoop Single Point of Failures

● Fast recovery in the case of a failure● Small impact on the performance● Adapt to the capacity and context of the nodes

Page 13: Hadoop Cluster on Docker Containers

Validating the Prototype

● Using the Docker-Hadoop dashboard allowed us to analyze different failure scenarios, including:

– Crash of the Job'Tracker node: we kill the JobTracker to force a new node to resume the JobTracker role.

– Restart of an old JobTracker: we investigate the impacts of the return of an old JobTracker node. Two possibilities are investigated:

● The returning node was simply disconnected from the network and still thinks it is the JobTracker.

● The returning node has restarted and has lost all its status, but is still on the top of Zookeeper's list.

– Heartbeat tuning: a too lazy heartbeat slows-down the reaction to failures and may lead to some of the situations in the previous item. An intensive heartbeat may impact negatively on the overall performance.

Page 14: Hadoop Cluster on Docker Containers

Performance Test

Page 15: Hadoop Cluster on Docker Containers

Performance Test

Execution time analysis when using different number of tasktrackers

Page 16: Hadoop Cluster on Docker Containers

Conclusion

● From this presentation we can explore the use of container-based virtual machines to develop a prototyping environment for MapReduce applications.

● The use of Docker-Hadoop allowed us to improve the development speed of our Hadoop solution, as the developers could test their code directly on their own computers.

Page 17: Hadoop Cluster on Docker Containers

References

● IEEE Paper 1

– Title: Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop

– Authors: Luiz Angelo Steffenel,Javier Rey, Matias Cogorno and Sergio Nesmachnow, France

– Publication: 2015 IEEE International Conference on Cloud Engineering● IEEE Paper 2

– Title: Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker Containers

– Authors: Rui Zhang, Min Li* and Dean Hildebrand, IBM Research and Almaden *IBM T.J. Watson Research Center

– Publication: 2015 IEEE International Conference on Cloud Engineering

Page 18: Hadoop Cluster on Docker Containers

Thank You