jvm bypass for effcient hadoop shuffling

Presented By Under the Guidance of

PRATHVIRAJ Mrs. SHANTHI

1CR11CS401 Assistant professor

Dept of CSE Dept of CSE

CMRIT CMRIT

1CMRIT 2013-14

Abstract

Introduction

Existing system

proposed system

System Architecture

Implementation

Conclusion

References

2CMRIT 2013-14

Hadoop employs Java-based network transport stack on top of the Java Virtual Machine (JVM) for its data shuffling and merging purposes.

JVM-Bypass Shuffling (JBS) for Hadoop

JBS helps Hadoop data shuffling by avoiding Java based transport protocols, removing the overhead and limitations of the JVM.

3CMRIT 2013-14

MapReduce is a popular programming model that provides a simple and scalable parallel data processing framework.

Hadoop as an open-source implementation of Map Reduce adopted by Yahoo,Facebook

data shuffling

But data shuffling results in great volumes of network traffic, reduces efficiency of data analytics application

4CMRIT 2013-14

JVM introduces significant overhead in managing Java objects results in shrinkage in memory available to hadoop

High speed networks, such as InfiniBandprovide Remote Direct Memory Access(RDMA) that is capable of up to 56Gbps bandwidth, and low CPU utilization.

5CMRIT 2013-14

Hadoop exposes two simple interfaces:map

reduce

Its runtime system consists of four major components: JobTracker, TaskTracker, MapTask, and ReduceTask.

Dominant source of network traffic:5% of large jobs can consume more than 98%network bandwidth

6CMRIT 2013-14

CMRIT 2013-14 7

Execution time

Cpu utilization

Network traffic

Overhead in managing java objects

CMRIT 2013-14 8

No change to existing user interface (map and reduce functions)

Bypass the JVM from the critical path of intermediate data shuffling

Portable layer on top of any network transport protocol

9CMRIT 2013-14

10CMRIT 2013-14

Asynchronous network operations

Increases locality of disk access

Reduces average delay of requests

11CMRIT 2013-14

Consolidates network fetching requests from all Reduce Tasks on a single node

Number of network connections, which is no longer the total amount of MOF Copiers from all Reduce Tasks

Reduces the resource requirements creating and sustaining many network channels

associated memory to buffer data

12CMRIT 2013-14

Connection Establishment for RDMA and RoCE

TCP/IP-Based Communication

CMRIT 2013-14 13

CMRIT 2013-14 14

Currently only Reliable Connection (RC)service provided by RDMA-capable interconnects is supported

512 active connections at a time

connections are torn down based on the LRU when exceeded after threshold

15CMRIT 2013-14

Event-driven model and multiple threads

On the client side,◦ one dedicated thread to prepare connection requests

◦ data threads request connection to remote servers

On the server side◦ one thread is listening for client connection requests

andaccepts them

Both client and server◦ Use epoll interface to monitor and detect events from

concurrent connections

◦ rely on their data threads to perform the network communication for data transfer.

CMRIT 2013-14 16

Simply switching to the high-performance interconnects cannot effectively boost Hadoop’s performance

Overhead imposed by JVM on Hadoopintermediate data shuffling identified

JVM-Bypass Shuffling (JBS) avoids JVM in the critical path of Hadoop data shuffling

portable library to leverage both conventional TCP/IP protocol and high-performance RDMA protocol

reduce the execution time of Hadoop jobs by up to66.3% a

lower the CPU utilization by 48.1%

17CMRIT 2013-14

[1] Apache Hadoop Project. http://hadoop.apache.org/.

[2] Plugin for Generic Shuffle Service. https://issues.apache.org/jira/browse/MAPREDUCE-4049.

[3] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing map reduce on heterogeneous clusters. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’12, pages 61–74, New York, NY, USA, 2012. ACM.

18CMRIT 2013-14

THANK YOU

19CMRIT 2013-14

jvm bypass for effcient hadoop shuffling

Engineering