jan 2013 hug: dist cpv2 for hug 20130116

DistCp (v.2) and the Dynamic InputFormatMithun RK

([email protected])

20130116

Me

Yahoo: HCatalog, Hive, GDM Firmware Engineer at Hewlett Packard Fluent Hindi Gold medal at the nationals, last year.

04/08/20232Yahoo! Presentation, Confidential

Prelude

“Legacy” DistCp

Inter-cluster file copy using Map/Reduce Command-line:

hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US \

hftp://source_nn:50070/datasets/search/20120523/UK \

hdfs://target_nn:8020/home/mithunr/target

Algo1. for (FileStatus f : FileSystem.globStatus(sourcePath)) { recurse(f); }

2. ~/_distCp_WIP_201301161600/file.list

3. InputSplit calculation:

1. Divide paths into ‘m’ groups (“splits”), one per mapper

2. Total file size in each split roughly equal to others.

4. Launch MR job

1. Each Map task copies files specified in its InputSplit

Source-path› http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0/src/tools/org/apache/

hadoop/tools/DistCp.java


The (Unfortunate) Beginning

Data Management on Y! Clusters

Grid Data Management (GDM)› Life-cycle management for data on Hadoop Clusters› 1+ Petabytes per day.

Facets:1. Acquisition: Warehouse -> Cluster

2. Replication: Cluster -> Cluster

3. “Retention”: Eviction of old data

4. Workflow tracking, Metrics, Monitoring, User-dashboards, Configuration management, etc.

GDM Replication:1. Use DistCp!

2. Don’t re-implement MR-job for file-copy.


A marriage doomed to fail

Poor programmatic use:› DistCp.main(“-m”, “20”, “hftp://source_nn1:50070/source”,

“hdfs://target_nn1:8020/target”);› Blocking call› Equal-size Copy-distribution: Can’t be overridden.

Long Setup-times:› Optimization: file.list contains only files that are changed/absent on target› Compare checksums› E.g. Experiment with 200 GB dataset: 14 minutes setup time.

Atomic commit:› Example: Oozie workflow-launch on data availability› Premature consumption

› Workarounds: _DONE_ markers:• Name-node pressure

• Hacks to ignore at source

Others


04/08/20239

DistCp Redux(Available in Hadoop 0.23/2.0)

Changes

(Almost) Identical command-line: hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \


hdfs://target_nn:8020/home/mithunr/target/

Reduced setup-times:› Postpone everything to MR job

› E.g. Experiment with 200 GB dataset:

• Old: 14 minutes setup time

• New: 7 seconds

Improved Copy-times:› Large dataset copy test: Time cut down from 17 hours to 7 hours.

Atomic commit:hadoop distcp –atomic –tmp /home/mithunr/tmp /source /target

Improved programmatic use:› options = new DistCpOptions(srcPaths, destPath).preserve(BLOCKSIZE).setBlocking(false);

› Job job = new DistCp(hadoopConf, options).execute();

Others› Bandwidth throttling, Asynchronous mode, Configurable copy-strategies.


Cost of copy

Copy-time is directly proportional to file-size› (All else being equal)

Long-tailed MR jobs› Copy twenty 2GB files between clusters. Why does one take longer than the rest?

› Hint: Sometimes, a file is slow initially, and then speeds up after a “block boundary”.

Are data-nodes equivalent?› Slower hard-drives

› Failing NICs

› Misconfiguration

Take a closer look at the command-line:hadoop distcp –m 20 \

hftp://source_nn:50070/datasets/search/20120523/US/ \


hdfs://target_nn:8020/home/mithunr/target/


Reads over hdfs://


Data Node

Data Node

Data Node

Data Node

DFS Client

User Program

Data Node

Reads over hftp://


Data Node

Data Node

Data Node

Data Node

DFS Client

User Program

Data Node

Long-tails


/datasets/search/US/20130101/part-00000.gz/datasets/search/US/20130101/part-00001.gz/datasets/search/US/20130101/part-00002.gz/datasets/search/US/20130101/part-00003.gz/datasets/search/US/20130101/part-00004.gz/datasets/search/US/20130101/part-00005.gz/datasets/search/US/20130101/part-00006.gz/datasets/search/US/20130101/part-00007.gz/datasets/search/US/20130101/part-00008.gz/datasets/search/US/20130101/part-00009.gz...

Input Split #1

STUCK!

SPLIT!

Mitigation

Break static binding between InputSplits and Mappers

E.g. Consider DistCp of N files with 10 mappers:

1. Don’t create 10 InputSplits. Create 20 instead.

2. Store each InputSplit as a separate file.

1. hdfs://home/mithunr/_distcp_20130116/work-pool/

3. Mapper consumes one InputSplit and checks for more.

4. Mappers quit when no more InputSplits are left.

Single file per InputSplit?› NameNode pressure.

DynamicInputFormat› Separate Library

Perf:› Worst-case is no worse than UniformSizeInputFormat

› Best-case: 17 hours -> 7 hours.


Future

Block-level parallelism› Stream blocks individually

› Stitch at the end: Metadata

Yarn› Master-worker paradigm


_DONE_

jan 2013 hug: dist cpv2 for hug 20130116

Documents

200 gb dataset

atomic commit

50070datasetssearch20120523uk

hadoop distcp

hftp

file

distcp

hdfs