Hadoop Cluster on Docker Containers

Download Hadoop Cluster on Docker Containers

Post on 17-Aug-2015




3 download


  1. 1. Hadoop Cluster on Docker Containers What Works and What Doesn't By: Pranav Joshi ME-HPC GTU PG School
  2. 2. Content Introduction to Hadoop and Docker Why Hadoop on Docker? Job Configuration Openstack Sahara Handling Hadoop Single Point of Failure Validating the Prototype Performance Test Conclusion Reference
  3. 3. Introduction to Hadoop Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Major Components of Apache Hadoop are, Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  4. 4. Introduction to Docker Container Docker allows you to package an application with all of its dependencies into a standardized unit for software development. It is an open-source program that enables a Linux application and its dependencies to be packaged as a container. Containers include the application and all of its dependencies, but share the kernel with other containers.
  5. 5. Why Docker? Lightweight, Portable Build once, Run anywhere VM without the overhead of a VM Isolated containers Automated and scripted
  6. 6. Separating out simple tasks
  7. 7. Container vs. VMs
  8. 8. Job Configuration YARNs ApplicationMaster asks the NodeManager to launch containers: LinuxContainerExecutor Docker can be used not only for fine-grained performance isolation, but for delivering software packages
  9. 9. Openstack Sahara
  10. 10. Design and Implementation Implementation: Using a Dockerfile, our solution creates an image with Java, ssh and some basic packages installed, and set up the image to use the Hadoop build in a shared folder with the host. When an instance is created from the image, it starts ssh daemon by default in order to allow further runtime configuration through this channel. Management: Cluster managing library offers an even more abstract API allowing the client to list and create a cluster, start, stop and get details of a container and starting service in a specific container.
  11. 11. Hadoop and Fault Tolerance HDFS allows the replication of the NameNode (through passive replication), but a failure at the level of the Job- Tracker forces a job to be restarted. On Hadoop 2.x, part of the job management responsibility is transferred to the ApplicationMaster, which becomes a task manager. The loss of the ResourceManager does not block the execution of a job, only prevents new jobs to be submitted. However, the loss of an ApplicationMaster forces the restart of the job, just like on Hadoop l.x.
  12. 12. Handling Hadoop Single Point of Failures Fast recovery in the case of a failure Small impact on the performance Adapt to the capacity and context of the nodes
  13. 13. Validating the Prototype Using the Docker-Hadoop dashboard allowed us to analyze different failure scenarios, including: Crash of the Job'Tracker node: we kill the JobTracker to force a new node to resume the JobTracker role. Restart of an old JobTracker: we investigate the impacts of the return of an old JobTracker node. Two possibilities are investigated: The returning node was simply disconnected from the network and still thinks it is the JobTracker. The returning node has restarted and has lost all its status, but is still on the top of Zookeeper's list. Heartbeat tuning: a too lazy heartbeat slows-down the reaction to failures and may lead to some of the situations in the previous item. An intensive heartbeat may impact negatively on the overall performance.
  14. 14. Performance Test
  15. 15. Performance Test Execution time analysis when using different number of tasktrackers
  16. 16. Conclusion From this presentation we can explore the use of container-based virtual machines to develop a prototyping environment for MapReduce applications. The use of Docker-Hadoop allowed us to improve the development speed of our Hadoop solution, as the developers could test their code directly on their own computers.
  17. 17. References IEEE Paper 1 Title: Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop Authors: Luiz Angelo Steffenel,Javier Rey, Matias Cogorno and Sergio Nesmachnow, France Publication: 2015 IEEE International Conference on Cloud Engineering IEEE Paper 2 Title: Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker Containers Authors: Rui Zhang, Min Li* and Dean Hildebrand, IBM Research and Almaden *IBM T.J. Watson Research Center Publication: 2015 IEEE International Conference on Cloud Engineering
  18. 18. Thank You