jan 2013 hug: dist cpv2 for hug 20130116
DESCRIPTION
TRANSCRIPT
Me
Yahoo: HCatalog, Hive, GDM Firmware Engineer at Hewlett Packard Fluent Hindi Gold medal at the nationals, last year.
04/08/20232Yahoo! Presentation, Confidential
Prelude
“Legacy” DistCp
Inter-cluster file copy using Map/Reduce Command-line:
hadoop distcp –m 20 \
hftp://source_nn:50070/datasets/search/20120523/US \
hftp://source_nn:50070/datasets/search/20120523/UK \
hdfs://target_nn:8020/home/mithunr/target
Algo1. for (FileStatus f : FileSystem.globStatus(sourcePath)) { recurse(f); }
2. ~/_distCp_WIP_201301161600/file.list
3. InputSplit calculation:
1. Divide paths into ‘m’ groups (“splits”), one per mapper
2. Total file size in each split roughly equal to others.
4. Launch MR job
1. Each Map task copies files specified in its InputSplit
Source-path› http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/tools/org/apache/
hadoop/tools/DistCp.java
04/08/20234Yahoo! Presentation, Confidential
The (Unfortunate) Beginning
Data Management on Y! Clusters
Grid Data Management (GDM)› Life-cycle management for data on Hadoop Clusters› 1+ Petabytes per day.
Facets:1. Acquisition: Warehouse -> Cluster
2. Replication: Cluster -> Cluster
3. “Retention”: Eviction of old data
4. Workflow tracking, Metrics, Monitoring, User-dashboards, Configuration management, etc.
GDM Replication:1. Use DistCp!
2. Don’t re-implement MR-job for file-copy.
04/08/20236Yahoo! Presentation, Confidential
A marriage doomed to fail
Poor programmatic use:› DistCp.main(“-m”, “20”, “hftp://source_nn1:50070/source”,
“hdfs://target_nn1:8020/target”);› Blocking call› Equal-size Copy-distribution: Can’t be overridden.
Long Setup-times:› Optimization: file.list contains only files that are changed/absent on target› Compare checksums› E.g. Experiment with 200 GB dataset: 14 minutes setup time.
Atomic commit:› Example: Oozie workflow-launch on data availability› Premature consumption
› Workarounds: _DONE_ markers:• Name-node pressure
• Hacks to ignore at source
Others
04/08/20237Yahoo! Presentation, Confidential
04/08/20238Yahoo! Presentation, Confidential
04/08/20239
DistCp Redux(Available in Hadoop 0.23/2.0)
Changes
(Almost) Identical command-line: hadoop distcp –m 20 \
hftp://source_nn:50070/datasets/search/20120523/US/ \
hftp://source_nn:50070/datasets/search/20120523/UK \
hdfs://target_nn:8020/home/mithunr/target/
Reduced setup-times:› Postpone everything to MR job
› E.g. Experiment with 200 GB dataset:
• Old: 14 minutes setup time
• New: 7 seconds
Improved Copy-times:› Large dataset copy test: Time cut down from 17 hours to 7 hours.
Atomic commit:hadoop distcp –atomic –tmp /home/mithunr/tmp /source /target
Improved programmatic use:› options = new DistCpOptions(srcPaths, destPath).preserve(BLOCKSIZE).setBlocking(false);
› Job job = new DistCp(hadoopConf, options).execute();
Others› Bandwidth throttling, Asynchronous mode, Configurable copy-strategies.
04/08/202311Yahoo! Presentation, Confidential
Cost of copy
Copy-time is directly proportional to file-size› (All else being equal)
Long-tailed MR jobs› Copy twenty 2GB files between clusters. Why does one take longer than the rest?
› Hint: Sometimes, a file is slow initially, and then speeds up after a “block boundary”.
Are data-nodes equivalent?› Slower hard-drives
› Failing NICs
› Misconfiguration
Take a closer look at the command-line:hadoop distcp –m 20 \
hftp://source_nn:50070/datasets/search/20120523/US/ \
hftp://source_nn:50070/datasets/search/20120523/UK \
hdfs://target_nn:8020/home/mithunr/target/
04/08/202313Yahoo! Presentation, Confidential
Reads over hdfs://
04/08/202314Yahoo! Presentation, Confidential
Data Node
Data Node
Data Node
Data Node
DFS Client
User Program
Data Node
Reads over hftp://
04/08/202315Yahoo! Presentation, Confidential
Data Node
Data Node
Data Node
Data Node
DFS Client
User Program
Data Node
Long-tails
04/08/202316Yahoo! Presentation, Confidential
/datasets/search/US/20130101/part-00000.gz/datasets/search/US/20130101/part-00001.gz/datasets/search/US/20130101/part-00002.gz/datasets/search/US/20130101/part-00003.gz/datasets/search/US/20130101/part-00004.gz/datasets/search/US/20130101/part-00005.gz/datasets/search/US/20130101/part-00006.gz/datasets/search/US/20130101/part-00007.gz/datasets/search/US/20130101/part-00008.gz/datasets/search/US/20130101/part-00009.gz...
Input Split #1
STUCK!
SPLIT!
Mitigation
Break static binding between InputSplits and Mappers
E.g. Consider DistCp of N files with 10 mappers:
1. Don’t create 10 InputSplits. Create 20 instead.
2. Store each InputSplit as a separate file.
1. hdfs://home/mithunr/_distcp_20130116/work-pool/
3. Mapper consumes one InputSplit and checks for more.
4. Mappers quit when no more InputSplits are left.
Single file per InputSplit?› NameNode pressure.
DynamicInputFormat› Separate Library
Perf:› Worst-case is no worse than UniformSizeInputFormat
› Best-case: 17 hours -> 7 hours.
04/08/202317Yahoo! Presentation, Confidential
Future
Block-level parallelism› Stream blocks individually
› Stitch at the end: Metadata
Yarn› Master-worker paradigm
04/08/202318Yahoo! Presentation, Confidential
_DONE_