myhadoop 0.30
DESCRIPTION
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features. myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)TRANSCRIPT
![Page 1: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/1.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop 0.30 and Beyond!
Glenn K. Lockwood, Ph.D.!User Services Group!
San Diego Supercomputer Center!
![Page 2: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/2.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Hadoop and HPC!PROBLEM: domain scientists aren't using Hadoop!• toy for computer scientists and salespeople!• Java is not high-performance!• don't want to learn computer science and Java!!SOLUTION: make Hadoop easier for HPC users!• use existing HPC clusters and software!• use Perl/Python/C/C++/Fortran instead of Java!• make starting Hadoop as easy as possible!
![Page 3: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/3.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Compute: Traditional vs. Data-Intensive!
Traditional HPC!• CPU-bound problems!• Solution: OpenMP-
and MPI-based parallelism!
Data-Intensive!• IO-bound problems!• Solution: Map/reduce-
based parallelism!
![Page 4: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/4.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Architecture for Both Workloads!PROs!
• High-speed interconnect!
• Complementary object storage!
• Fast CPUs, RAM!• Less faulty!
CONs!• Nodes aren't storage-
rich!• Transferring data
between HDFS and object storage*!
!* unless using Lustre, S3, etc backends!
![Page 5: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/5.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute Infrastructure!
![Page 6: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/6.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute Infrastructure!
![Page 7: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/7.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute Infrastructure!
![Page 8: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/8.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Add Data Analysis to Existing Compute Infrastructure!
![Page 9: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/9.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop – 3-step Install!1. Download Apache Hadoop 1.x and myHadoop
0.30!$ wget http://apache.cs.utah.edu/hadoop/common/
hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz!$ wget glennklockwood.com/files/myhadoop-0.30.tar.gz!
2. Unpack both Hadoop and myHadoop!$ tar zxvf hadoop-1.2.1-bin.tar.gz!$ tar zxvf myhadoop-0.30.tar.gz!
3. Apply myHadoop patch to Hadoop!$ cd hadoop-1.2.1/conf!$ patch < ../myhadoop-0.30/myhadoop-1.2.1.patch!
!
![Page 10: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/10.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
myHadoop – 3-step Cluster!1. Set a few environment variables!
# sets HADOOP_HOME, JAVA_HOME, and PATH!$ module load hadoop!$ export HADOOP_CONF_DIR=$HOME/mycluster.conf!!
2. Run myhadoop-configure.sh to set up Hadoop!$ myhadoop-configure.sh -s /scratch/$USER/$PBS_JOBID!!
3. Start cluster with Hadoop's start-all.sh!$ start-all.sh!
![Page 11: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/11.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Easy Wordcount in Python!#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() keys = line.split() for key in keys value = 1 print( '%s\t%d' % (key, value) )
#!/usr/bin/env python import sys last_key = None running_tot = 0 for input_line in sys.stdin: input_line = input_line.strip() this_key, value = input_line.split("\t", 1) value = int(value) if last_key == this_key: running_tot += value else: if last_key: print( "%s\t%d" % (last_key, running_tot) ) running_tot = value last_key = this_key if last_key == this_key: print( "%s\t%d" % (last_key, running_tot) )
https://github.com/glennklockwood/hpchadoop/tree/master/wordcount.py!
![Page 12: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/12.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Easy Wordcount in Python!
$ hadoop dfs -‐put ./input.txt mobydick.txt $ hadoop jar \ /opt/hadoop/contrib/streaming/hadoop-‐streaming-‐1.1.1.jar \ -‐mapper "python $PWD/mapper.py" \ -‐reducer "python $PWD/reducer.py" \ -‐input mobydick.txt \ -‐output output $ hadoop dfs -‐cat output/part-‐* > ./output.txt
![Page 13: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/13.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Data-Intensive Performance Scaling!8-node Hadoop cluster on Gordon!
8 GB VCF (Variant Calling Format) file!
9x speedup using 2 mappers/node!
![Page 14: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/14.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Data-Intensive Performance Scaling!7.7 GB of chemical evolution data, 270 MB/s
processing rate!
![Page 15: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/15.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Advanced Features - Useability!• System-wide default configurations!
• myhadoop-0.30/conf/myhadoop.conf!• MH_SCRATCH_DIR – specify location of node-local
storage for all users!• MH_IPOIB_TRANSFORM – specify regex to transform
node hostnames into IP over InfiniBand hostnames!• Users can remain totally ignorant of scratch
disks and InfiniBand!• Literally define HADOOP_CONF_DIR and run myhadoop-configure.sh with no parameters – myHadoop figures out everything else!
![Page 16: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/16.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
Advanced Features - Useability!• Parallel filesystem support!
• HDFS on Lustre via myHadoop persistent mode (-p)!• Direct Lustre support (IDH)!• No performance loss at smaller scales for HDFS on Lustre!
• Resource managers supported in unified framework:!• Torque 2.x and 4.x – Tested on SDSC Gordon!• SLURM 2.6 – Tested on TACC Stampede!• Grid Engine!• Can support LSF, PBSpro, Condor easily (need testbeds)!
![Page 17: myHadoop 0.30](https://reader034.vdocuments.site/reader034/viewer/2022052618/554a077db4c905507a8b55c5/html5/thumbnails/17.jpg)
SAN DIEGO SUPERCOMPUTER CENTER
New Features!• Perl reimplementation for expandability!• Scalability improvements!
• Separate namenode, secondary namenode, jobtrackers!• Automatic switchover based on cluster size best practices!
• Hadoop 2.0 / YARN support (halfway there)!• Spark integration!