next generation of hadoop mapreduce

19
Next Generation of Apache Hadoop MapReduce Owen O’Malley [email protected] @owen_omalley

Upload: huguk

Post on 11-May-2015

2.872 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Next Generation of Hadoop MapReduce

Next Generation of Apache Hadoop MapReduce

Owen O’[email protected]

@owen_omalley

Page 2: Next Generation of Hadoop MapReduce

What is Hadoop? A framework for storing and processing big data on

lots of commodity machines.

- Up to 4,000 machines in a cluster

- Up to 20 PB in a cluster

Open Source Apache project

High reliability done in software

- Automated failover for data and computation

Implemented in Java

Primary data analysis platform at Yahoo!

- 40,000+ machines running Hadoop

Page 3: Next Generation of Hadoop MapReduce

What is Hadoop? HDFS – Distributed File System

- Combines cluster’s local storage into a single namespace.

- All data is replicated to multiple machines.

- Provides locality information to clients

MapReduce

- Batch computation framework

- Tasks re-executed on failure

- User code wrapped around a distributed sort

- Optimizes for data locality of input

Page 4: Next Generation of Hadoop MapReduce

twice the engagement

3

Personalized

for each visitor

Result:

twice the engagement

+160% clicksvs. one size fits all

+79% clicksvs. randomly selected

+43% clicksvs. editor selected

Recommended links News Interests Top Searches

Case Study: Yahoo Front Page

Page 5: Next Generation of Hadoop MapReduce

Hadoop MapReduce Today

JobTracker

- Manages cluster resources and job scheduling

TaskTracker

- Per-node agent

- Manage tasks

Page 6: Next Generation of Hadoop MapReduce

Current Limitations

Scalability

- Maximum Cluster size – 4,000 nodes

- Maximum concurrent tasks – 40,000

- Coarse synchronization in JobTracker

Single point of failure

- Failure kills all queued and running jobs

- Jobs need to be re-submitted by users

Restart is very tricky due to complex state

Hard partition of resources into map and reduce slots

Page 7: Next Generation of Hadoop MapReduce

Current Limitations

Lacks support for alternate paradigms

- Iterative applications implemented using MapReduce are 10x slower.

- Users use MapReduce to run arbitrary code

- Example: K-Means, PageRank

Lack of wire-compatible protocols

- Client and cluster must be of same version

- Applications and workflows cannot migrate to different clusters

Page 8: Next Generation of Hadoop MapReduce

MapReduce Requirements for 2011

Reliability

Availability

Scalability - Clusters of 6,000 machines

- Each machine with 16 cores, 48G RAM, 24TB disks

- 100,000 concurrent tasks

- 10,000 concurrent jobs

Wire Compatibility

Agility & Evolution – Ability for customers to control upgrades to the grid software stack.

Page 9: Next Generation of Hadoop MapReduce

MapReduce – Design Focus

Split up the two major functions of JobTracker

- Cluster resource management

- Application life-cycle management

MapReduce becomes user-land library

Page 10: Next Generation of Hadoop MapReduce

Architecture

Page 11: Next Generation of Hadoop MapReduce

Architecture

Resource Manager

- Global resource scheduler

- Hierarchical queues

Node Manager

- Per-machine agent

- Manages the life-cycle of container

- Container resource monitoring

Application Master

- Per-application

- Manages application scheduling and task execution

- E.g. MapReduce Application Master

Page 12: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Scalability

- Application life-cycle management is very expensive

- Partition resource management and application life-cycle management

- Application management is distributed

- Hardware trends• Machines are getting bigger and faster

• Moving toward 12 2TB disks instead of 4 1TB disks

• Enables more tasks per a machine

Page 13: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Availability

- Application Master• Optional failover via application-specific checkpoint

• MapReduce applications pick up where they left off

- Resource Manager• No single point of failure - failover via ZooKeeper

• Application Masters are restarted automatically

Page 14: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Wire Compatibility

- Protocols are wire-compatible

- Old clients can talk to new servers

- Evolution toward rolling upgrades

Page 15: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Innovation and Agility

- MapReduce now becomes a user-land library

- Multiple versions of MapReduce can run in the same cluster (a la Apache Pig)

• Faster deployment cycles for improvements

- Customers upgrade MapReduce versions on their schedule

- Users can use customized MapReduce versions without affecting everyone!

Page 16: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Utilization

- Generic resource model • Memory

• CPU

• Disk b/w

• Network b/w

- Remove fixed partition of map and reduce slots

Page 17: Next Generation of Hadoop MapReduce

Improvements vis-à-vis current MapReduce

Support for programming paradigms other than MapReduce

- MPI

- Master-Worker

- Machine Learning and Iterative processing

- Enabled by paradigm-specific Application Master

- All can run on the same Hadoop cluster

Page 18: Next Generation of Hadoop MapReduce

Summary

Takes Hadoop to the next level

- Scale-out even further

- High availability

- Cluster Utilization

- Support for paradigms other than MapReduce