hadoop - university of...

21
Hadoop Yizheng (Ethan) Chen Advisor: Prof. Aditya Akella

Upload: others

Post on 23-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

HadoopYizheng (Ethan) Chen

Advisor: Prof. Aditya Akella

Page 2: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Outline

Hadoop

Yarn (NextGen Hadoop)

Page 3: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop

• What is Apache Hadoop– A framework (open‐source software) for reliable, scalable, distributed computing

• Hadoop MapReduce– A system for parallel processing of large data sets

• Hadoop Distributed File System (HDFS™)– A distributed file system that provides high‐throughput access to application data

– Similar to GFS

– http://hadoop.apache.org/• Why Hadoop?

Page 4: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop Job Execution

Page 5: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hello 1Hadoop 1Goodbye 1Hadoop 1

Word Count

Hello World Bye World 

Hello HadoopGoodbye Hadoop

Hello 1World 1Bye 1World 1

MapTask 1 sort

ReduceTask 1 (keys: A‐G)

sort

merge/sort

Bye 1Hello 1World 1World 1

Goodbye 1Hadoop 1Hadoop 1Hello 1

combiner (local aggregation)

Bye 1Hello 1World 2

Goodbye 1Hadoop 2Hello 1

Bye 1

Goodbye 1

Bye 1Hello 1World 2

Goodbye 1Hadoop 2Hello 1

MapTask 2output

Hello 1World 2

Hadoop 2Hello 1

MapTask 2

ReduceTask 2 (keys: H‐Z)

Bye 1Goodbye 1

Hadoop 2Hello 1Hello 1World 2

Bye 1Goodbye 1

Hadoop 2Hello 2World 2

Bye 1Goodbye 1

Hadoop 2Hello 2World 2

shuffle HDFS part0

HDFS part1

MapTask 1output

Page 6: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop MapReduce

Page 7: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop Schedulers

• A pluggable framework for job scheduling algorithm available since Hadoop 0.19– FIFO– Fair Scheduler (Facebook)– Capacity Scheduler (Yahoo!)

Page 8: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

FIFO schedulerOriginally optimized for large batch jobs(web index construction)FIFO order + priority queues

Page 9: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Fair Scheduler

Page 10: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Capacity Scheduler

Page 11: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

HDFS Architecture

Page 12: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop Ecosystem

HBase: BigTable‐likeHive: Data summarization and ad hoc queryingPig:  A high‐level data‐flow language and execution framework for parallel computationHCatalog: Table and storage management service (table abstraction of data)Zookeeper: A high‐performance coordination service for distributed applications

Page 13: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Hadoop and Hadoop‐derived Distributtions

https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know

*Cloudera Distribution Including  Apache Hadoop (CDH)

*Greenplum HD (EMC)*Hortonworks Data Platform (Yarn)*MapR

Page 14: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Outline

Hadoop

Yarn (NextGen Hadoop)

Page 15: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Yarn (NextGen Hadoop)

ResourceManager:*Scheduler: allocate resources to the various running applications (pluggable policy plug‐in)*ApplicationsManager : accept job‐submissions/launch the first container for ApplicationMaster

Split up the two major functionalities of the JobTracker: * Management* Job scheduling/monitoring

Page 16: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Resource Allocation• the resource request understood by the Scheduler is of the form:

<priority, (hostname/rackname/*), capability, #containers>

• Scheduler APIThere is a single API between the Scheduler and the 

ApplicationMaster:(List <Container> newContainers, List <ContainerStatus> containerStatuses) allocate (List <ResourceRequest> ask, List<Container> release) 

Page 17: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

Compact:O(clustersize)

Page 18: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source
Page 19: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source
Page 20: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source

HDFS Federation

Benefits:* Namespace Scalability* Performance* Isolation

Page 21: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source