hadoop distributed file system for the grid -

1

Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing

Garhan Attebury1, Andrew Baranovski2, Ken Bloom1, Brian Bockelman1,

Dorian Kcira3, James Letts4, Tanya Levshina2, Carl Lundestedt1, Terrence

Martin4, Will Maier5, Haifeng Pi 4, Abhishek Rana4, Igor Sfiligoi4, Alexander

Sim6, Michael Thomas3, Frank Wuerthwein4

1. University of Nebraska Lincoln 2. Fermi National Accelerator Laboratory 3. California Institute of Technology 4. University of California, San Diego 5. University of Wisconsin Madson 6. Lawrence Berkeley National Laboratory

On Behalf of Open Science Grid (OSG) Storage Hadoop Community

2

Storage, a critical component of Grid

• Grid computing is data-intensive and CPU-intensive, which requires– Scalable management system for bookkeeping and discovering data– Reliable and fast tools for distributing and replicating data– Efficient procedures for processing and extracting data– Advanced techniques for analyzing and storing data in parallel

• A scalable, dynamic, efficient and easy-to-maintain storage system is on the critical path to the success of grid computing– Meet various data access needs in both organization and individual level– Maximize the CPU usage and efficiency– Fit into sophisticated VO policies (e.g. Data security, user privilege )– Survive the “unexpected” usage of storage system– Minimize the cost of ownership– Easy to expand, reconfigure, commission/decommission as requirement

changes

3

A Case Study, Some Requirements for Storage Element (SE) at Compact Muon Solenoid (CMS)

• Have a credible support model that meets the reliability, availability, and security expectations consistent with the computing infrastructure

• Demonstrate the ability to interface with the existing global data transfer system and the transfer technology of SRM tools and FTS as well as demonstrate the ability to interface to the CMS software locally through ROOT

• Well-defined and reliable behavior for recovery from the failure of any hardware components.

• Well-defined and reliable method of replicating files to protect against the loss of any individual hardware system

• Well-defined and reliable procedure for decommissioning hardware without data loss

• Well-defined and reliable procedure for site operators to regularly check the integrity of all files in the SE

• Well-defined interfaces to monitoring systems• Capable of delivering at least 1 MB/s/batch slot for CMS applications, capable

of writing files from the WAN at a performance of at least 125MB/s while simultaneously writing data from the local farm at an average rate of 20MB/s.

• Failures of jobs due to failure to open the file or deliver the data products from the storage systems should be at the level of less than 1 in 105 level.

4

Hadoop Distributed File System (HDFS)

• Open source project hosted by Apache (http://hadoop.apache.org) and used by YAHOO for its search engine with multiple-PB scale of data involved

• Design goal– reduce the impact of hardware failure– Stream data access– handle large datasets– Simple coherency model– Portability across heterogeneous platforms

• A scalable distributed cluster file system– The namespace and image of the whole file system is maintained in one single

machine's memory, NameNode– The files are split into blocks and stored across the cluster, DataNode– File blocks can be replicated. Loss of one DataNode can be recovered from the replica

blocks in other DataNodes.

5

Important Components of HDFS-based SE• Fuse/Fuse-DFS

– A linux kernel module, allows file systems to be written in userspace and POSIX-like interface to HDFS

– Important for the software application accessing data in the local SE

• Globus GridFTP– provide WAN transfer between to SE(s) or SE and workernode (WN). – A special plugin is needed to assemble asynchronous transfered packets for

sequential writing to the HDFS if multiple streams are used

• BeStMan– provide SRM interface to the HDFS– Possible to develop/implement plugins to select GridFTP servers according to the

status of the GridFTP servers

A number of software bugs and integration issues have been solved for the last 12 months to really bring all the components together and make a production quality SE

6

HDFS SE Architecture for Scientific Computing

WorkerNode + (DataNode) +(GridFTP)

FUSE + Hadoop Client







GridFTP Node


GridFTP Node


Dedicated Data Node

Hadoop Client

Dedicated Data Node

Hadoop ClientNameNode

(secondary NN)

BeStMan

Fuse + Hadoop Client

GUMSProxy-User Mapping

7

HDFS-based SE at CMS Tier-2

• Currently three CMS Tier-2 sites, Nebraska, Caltech and UCSD, deployed HDFS-based SE

– Average 6-12 months operation experience with increasing scale in total disk space

– Currently around 100 DataNodes ranging from 300 to 500 TB for each site – Successfully serve the CMS collaboration with up to thousands of grid users and

hundreds of local users to access the dataset in HDFS– Successfully serve the data operation and Monte Carlo production run by the CMS

• What benefits the new SE brings to these sites – Reliability: stop loss of files because of a decent file replica schemes run by HDFS– Simple deployment: most of the deployment procedure is streamlined with fewer

commands done by the administrators– Easy operation: stable system, little effort for system/file recovery, less than 30 min

for daily operation and user support– Proved scalability for supporting a large number of simultaneous Read/Write

operation and high throughput for serving the data for grid jobs running at the site

8

Highlight of Operational Performance of HDFS-SE

• Stably deliver ~3MB/s to applications in the cluster while the cluster is fully loaded with jobs

– Sufficient for CMS application's requirement on I/O with high CPU efficiency– CMS application is IOPS limited, not bandwidth limited

• HDFS NameNode serves 2500 user request per second– Sufficient for a cluster with thousand of cores with I/O intensive jobs

• Sustained WAN transfer rate 400MB/s– Sufficient for CMS Tier-2 data operation (dataset transfer and stage-out of user

analysis jobs)

• Simultaneously processing thousand client's request at BeStMan– Sustained endpoint processing rate 50 Hz– Sufficient for high-rate transfers of gigabytes-sized files and uncontrolled chaotic

user jobs

• Observed extremely low file corruption rate– Benefit from robust and fast file replication of HDFS

• Decommissioning of a DataNode < 1 hour, restart NameNode in 1 minute, check the image of file system (from memory of NameNode) in 10 sec

– Fast and efficient for the operation

• Survive various stress test that involves HDFS, BeStMan, GridFTP ...

9

Data Transfer to HDFS-SE

10

NameNode Operation Count

11

Processing Rate at SRM endpoint

12

Monitoring and Routine Test

• Integration with general grid monitoring infrastructure– Nagious, Ganglia, MonALISA– CPU, memory, network statistics for the NameNode, DataNode and the whole

system

• HDFS monitoring– Hadoop web service, Hadoop Chronicle, Jconsole

• Status of the file system and user– Logs of NameNode, DataNode and GridFTP, BeStMan

• As part of the daily tasks and debugging activities

• Regular low-stress test performed by CMS VO– Test analysis jobs, load test of file transfer – Part of the daily commission of the site involves local and remote I/O of the SE

• Intentional failure in various parts of the SE with demonstrated recovery mechanism

– Documentation of recovery procedure

13

Load test between two HDFS-SE

14

Data Security and Integrity

• Security concerns– HDFS

• No encryption or strong authentication between client and server. HDFS must only be exposed to a secure internal network

• Practically firewall or NAT is needed to properly isolated the HDFS from direct “public” access

• Latest HDFS implements access token. Transition to kerberos-based components is expected in 2010.

– Grid components (GridFTP and BeStMan) • Use standard GSI security with VOMS extensions

• Data integrity and consistency of the file system– HDFS Checksum for blocks of data– Command line tool to check block, directory and file– HDFS keeps multiple journal and file system image– NameNode periodically requests the entire block report from all

DataNode.

15

A Combined Release Infrastructure at OSG and CMS

• Various original open sources provide all the necessary packages– HDFS, FUSE, BeStMan, GridFTP plugins, BeStMan plugins ...

• All software components needed for deploying the hadoop-based SE are packaged as RPM

– with add-on configuration and scripts necessarily to a site to install with minimal changes according to the site condition and requirement

• Consistency check and validation are done in selected sites with HDFS-SE experts before the formal release via OSG

– a testbed for common platforms and scalability test

• Development in 2010– Release procedure to be fully integrated into standard OSG distribution: Virtual

Data Toolkit (VDT)– Possibility of some intersection with external commercial packagers, e.g., using

selected RPMs from Cloudera

16

Site Specific Optimization

Various optimization can be done for each site based on the usage patterns and local hardware condition

• Block size for files• Number of file replicas• Architecture of GridFTP server deployment

– A few high performance GridFTP vs. many GridFTP running at the WorkerNode

• Memory allocation at WorkerNode (WN) for GridFTP, application ...• Selection of GridFTP servers

– Real-time-monitoring-based GridFTP selection base on CPU and memory usage vs. randomly picking alive GridFTP

• Data access with MapReduce– A special case for data processing

• Rack awareness

17

Summary of Our Experience• Hadoop-based storage solution is established and functioning at CMS

tier-2 level as an example of data- and CPU-intensive HPC – Flexible in the architecture involving various grid components– Scalable and stable– Seamlessly interfaced with various grid middleware

• Lower costs in deployment, maintenance, and required hardware– Significantly reduce manpower and increase QoS– Easy to adapt to existing/new hardware and changing requirements– Standard release for the whole community– Experts available to help solve the technical problems

• VO and grid sites benefit from reliable HDFS file replica and distribution scheme

– High data security and integrity– Excellent I/O performance for CPU- and data-intensive grid applications– Less administrator intervention

HFDS is shown to be seamlessly integrated into a grid storage solution for a Virtual Organization (VO) or grid site

18

Roadmap for the Near Future

• Deployment in a varieties of scientific computing projects, or experiments, or institutions

– As a integrated storage element solution– As a storage file system

• Benchmark performance for HPC with data- and CPU-intensive grid computing

– Scalability, Stability, Usability– Integration and efficiency with other tools

• Organization– Seamless integration between scientific user community and HDFS development

community– Consolidation of scientific release and technical support

• New development and contribution from scientific community– Funding proposal based on HDFS infrastructure and technology– Improvement in I/O Capacity and full integration as a critical component of Storage

Element– Operational optimization with different scales of data and compute infrastructure

hadoop distributed file system for the grid -

Documents