jvm bypass for effcient hadoop shuffling
TRANSCRIPT
Presented By Under the Guidance of
PRATHVIRAJ Mrs. SHANTHI
1CR11CS401 Assistant professor
Dept of CSE Dept of CSE
CMRIT CMRIT
1CMRIT 2013-14
Abstract
Introduction
Existing system
proposed system
System Architecture
Implementation
Conclusion
References
2CMRIT 2013-14
Hadoop employs Java-based network transport stack on top of the Java Virtual Machine (JVM) for its data shuffling and merging purposes.
JVM-Bypass Shuffling (JBS) for Hadoop
JBS helps Hadoop data shuffling by avoiding Java based transport protocols, removing the overhead and limitations of the JVM.
3CMRIT 2013-14
MapReduce is a popular programming model that provides a simple and scalable parallel data processing framework.
Hadoop as an open-source implementation of Map Reduce adopted by Yahoo,Facebook
data shuffling
But data shuffling results in great volumes of network traffic, reduces efficiency of data analytics application
4CMRIT 2013-14
JVM introduces significant overhead in managing Java objects results in shrinkage in memory available to hadoop
High speed networks, such as InfiniBandprovide Remote Direct Memory Access(RDMA) that is capable of up to 56Gbps bandwidth, and low CPU utilization.
5CMRIT 2013-14
Hadoop exposes two simple interfaces:map
reduce
Its runtime system consists of four major components: JobTracker, TaskTracker, MapTask, and ReduceTask.
Dominant source of network traffic:5% of large jobs can consume more than 98%network bandwidth
6CMRIT 2013-14
CMRIT 2013-14 7
Execution time
Cpu utilization
Network traffic
Overhead in managing java objects
CMRIT 2013-14 8
No change to existing user interface (map and reduce functions)
Bypass the JVM from the critical path of intermediate data shuffling
Portable layer on top of any network transport protocol
9CMRIT 2013-14
10CMRIT 2013-14
Asynchronous network operations
Increases locality of disk access
Reduces average delay of requests
11CMRIT 2013-14
Consolidates network fetching requests from all Reduce Tasks on a single node
Number of network connections, which is no longer the total amount of MOF Copiers from all Reduce Tasks
Reduces the resource requirements creating and sustaining many network channels
associated memory to buffer data
12CMRIT 2013-14
Connection Establishment for RDMA and RoCE
TCP/IP-Based Communication
CMRIT 2013-14 13
CMRIT 2013-14 14
Currently only Reliable Connection (RC)service provided by RDMA-capable interconnects is supported
512 active connections at a time
connections are torn down based on the LRU when exceeded after threshold
15CMRIT 2013-14
Event-driven model and multiple threads
On the client side,◦ one dedicated thread to prepare connection requests
◦ data threads request connection to remote servers
On the server side◦ one thread is listening for client connection requests
andaccepts them
Both client and server◦ Use epoll interface to monitor and detect events from
concurrent connections
◦ rely on their data threads to perform the network communication for data transfer.
CMRIT 2013-14 16
Simply switching to the high-performance interconnects cannot effectively boost Hadoop’s performance
Overhead imposed by JVM on Hadoopintermediate data shuffling identified
JVM-Bypass Shuffling (JBS) avoids JVM in the critical path of Hadoop data shuffling
portable library to leverage both conventional TCP/IP protocol and high-performance RDMA protocol
reduce the execution time of Hadoop jobs by up to66.3% a
lower the CPU utilization by 48.1%
17CMRIT 2013-14
[1] Apache Hadoop Project. http://hadoop.apache.org/.
[2] Plugin for Generic Shuffle Service. https://issues.apache.org/jira/browse/MAPREDUCE-4049.
[3] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing map reduce on heterogeneous clusters. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’12, pages 61–74, New York, NY, USA, 2012. ACM.
18CMRIT 2013-14
THANK YOU
19CMRIT 2013-14