sky agile horizons hadoop at sky. what is hadoop? - reliable, scalable, distributed where did it...

19
Sky Agile Horizons Hadoop at Sky

Upload: amanda-clark

Post on 12-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

Sky Agile HorizonsHadoop at Sky

Page 2: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• What is Hadoop?- Reliable, Scalable, Distributed

• Where did it come from?- Community + Yahoo!

• Where is it now? - Apache Software Foundation

• Why is it called “Hadoop”?

1.01

Hadoop at Sky

Overview

Page 3: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

To name just a few…

1.02

Hadoop at Sky

Who is using it?

Page 4: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

This screengrab is from one of the Hadoop clusters at Facebook (May 2010)

1.03

Hadoop at Sky

Is it “production” ready?

Page 5: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.04

Hadoop at Sky

So, what does it give you?

Page 6: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Distributed Filesystem (HDFS)- Name Node- Data Node(s)

• Distributed Processing Infrastructure- Job Tracker- Task Tracker(s)

1.05

Hadoop at Sky

Just two things...

Page 7: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Blocks- 64MB chunks (configurable)

• WORM (Write once, read many)

- NO EDITS- NO APPENDS

• Replication- 3 copies- direct

1.06

Hadoop at Sky

HDFS - Overview

Page 8: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.07

Hadoop at Sky

HDFS - ReadName Node

1 1 1 2

2

2

3 3 34

4 4

Client 1. Get Metadata

2. Fetch Blocks

Data Nodes

Control / Monitoring

Page 9: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.08

Hadoop at Sky

HDFS - WriteName Node

1 32

Client 1. Create Metadata

2. Put Blocks

Data Nodes

Control / Monitoring

1 1

2 2

3 3

Page 10: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Slots- X mapper slots, Y reducer slots (per node)

• Jobs- Queued- Prioritised

• Tasks

- Data-aware

1.09

Hadoop at Sky

Distributed Processing

Page 11: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.10

Hadoop at Sky

Distributed ProcessingJob TrackerClient 1. Setup Job

Task Trackers

Control / Monitoring

M M

M M

R R

M M

M M

R R

M M

M M

R R

M M

M M

R R

M M

M M

R R

Page 12: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Two modes of operation

1.11

Hadoop at Sky

Implementation

Name Node

Data Node

Job Tracker

Task Tracker

Standalone

Name Node

Job Tracker

Master

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Slaves

Page 13: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.12

Hadoop at Sky

Building upon the basics

Page 14: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Map/Reduce – divide & conquer

• Pig – SQL-like “Pig Latin”

• HBase – column-based database

• Hive – data-warehousing (SQL-like queries)

• Mahout – distributed algorithms

1.13

Hadoop at Sky

Sub-projects

Page 15: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• Java-based- Key,Value input, Key,Value output(s)

• Intended for low-level / bespoke work

1.14

Hadoop at Sky

Map/Reduce

Start

M

M

M

M

M

R

M

R

R

R

R

End

Page 16: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• SQL-like syntax, Map/Reduce under the hood

• Client-only software

1.15

Hadoop at Sky

Hive

Query

M R

Results

M R M R M R

Page 17: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.16

Hadoop at Sky

Live Demo

Page 18: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

• It’s not a magic bullet…

• If the tools you need don’t exist…

• Approach is everything…

• Hadoop is *just* the framework

1.17

Hadoop at Sky

Lastly, word of warning...

Page 19: Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache

1.18

Hadoop at Sky

Thank you!

Questions?

http://cotdp.com/hadoop.html- Soft-copy of this presentation- VM image available to download- Example code is on GitHub