an intro to hadoop

180
Hadoop :G:56 2?5 4@?BF6C 8:82?E:4 52E2 L $2EE96H $4F==@F89 >3:6?E !562D ##

Upload: matthew-mccullough

Post on 01-Nov-2014

19 views

Category:

Technology


3 download

DESCRIPTION

Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.

TRANSCRIPT

Page 1: An Intro to Hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�

Page 2: An Intro to Hadoop

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Page 3: An Intro to Hadoop
Page 4: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Page 5: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

Page 6: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

Page 7: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

Page 8: An Intro to Hadoop

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

Page 9: An Intro to Hadoop
Page 10: An Intro to Hadoop

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

Page 11: An Intro to Hadoop

Today

-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

Page 12: An Intro to Hadoop

Today

Page 13: An Intro to Hadoop

Today

Page 14: An Intro to Hadoop

Why Hadoop?

Page 15: An Intro to Hadoop

$74.85

Page 16: An Intro to Hadoop

$74.85

4gb

Page 17: An Intro to Hadoop

$74.85

1tb

4gb

Page 18: An Intro to Hadoop

vs

Page 19: An Intro to Hadoop

vs$10,000

$1,000

Page 20: An Intro to Hadoop

vs

Page 21: An Intro to Hadoop

vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

Page 22: An Intro to Hadoop

Bzzzt!

Crrrkt!

Sproinnnng!

Page 23: An Intro to Hadoop

)��$*

Structured

&��

Page 24: An Intro to Hadoop

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8���D

�@4F>6?ED

Page 25: An Intro to Hadoop

NOSQL

Page 26: An Intro to Hadoop

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6

Page 27: An Intro to Hadoop

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

Page 28: An Intro to Hadoop

Applications

Page 29: An Intro to Hadoop

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

Page 30: An Intro to Hadoop

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D

Page 31: An Intro to Hadoop

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8

Page 32: An Intro to Hadoop

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

Page 33: An Intro to Hadoop

Applications

Page 34: An Intro to Hadoop

Applications

'C:46�D62C49

Page 35: An Intro to Hadoop

Applications

'C:46�D62C49*E682?@8C2A9J

Page 36: An Intro to Hadoop

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D

Page 37: An Intro to Hadoop

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

Page 38: An Intro to Hadoop

Particle Physics

Page 39: An Intro to Hadoop

Particle Physics

#2C86� 25C@?��@==:56C

Page 40: An Intro to Hadoop

Particle Physics

#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C

Page 41: An Intro to Hadoop

Financial Trends

Page 42: An Intro to Hadoop

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D

Page 43: An Intro to Hadoop

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

Page 44: An Intro to Hadoop

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD

Page 45: An Intro to Hadoop

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

Page 46: An Intro to Hadoop

Contextual Ads

Page 47: An Intro to Hadoop

Contextual Ads

*9@AA6C�� �FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E����

H9@�3@F89E�.

Page 48: An Intro to Hadoop

Hadoop Family

Page 49: An Intro to Hadoop

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

Page 50: An Intro to Hadoop

the Players

Page 51: An Intro to Hadoop

the PlayAs

Page 52: An Intro to Hadoop

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the PlayAs

Page 53: An Intro to Hadoop

MapReduce

Page 54: An Intro to Hadoop

MapReduce>2A�E96?����F>����C65F46�

Page 55: An Intro to Hadoop

The process

Page 56: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A

Page 57: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G��

Page 58: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J

Page 59: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA

Page 60: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �

Page 61: An Intro to Hadoop

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE

Page 62: An Intro to Hadoop

Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

@FE

Page 63: An Intro to Hadoop

Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E

Page 64: An Intro to Hadoop

Reduce

@FE����24E65���� J62CD����

7@FC���

2E����

@=5����

?@E����

2>����

!���

8=25����

D@>6@?6����=:<6����24E����DE:==����

J6E���

Page 65: An Intro to Hadoop

MapReduce Demo

Page 66: An Intro to Hadoop

HDFS

Page 67: An Intro to Hadoop
Page 68: An Intro to Hadoop

HDFS Basics

Page 69: An Intro to Hadoop

HDFS Basics

�2D65�@?��@@8=6��:8+23=6

Page 70: An Intro to Hadoop

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6

Page 71: An Intro to Hadoop

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D

Page 72: An Intro to Hadoop

Data Overload

Page 73: An Intro to Hadoop

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Page 74: An Intro to Hadoop

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Page 75: An Intro to Hadoop

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Page 76: An Intro to Hadoop

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Page 77: An Intro to Hadoop
Page 78: An Intro to Hadoop

HDFS Demo

Page 79: An Intro to Hadoop

Pig

Page 80: An Intro to Hadoop

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

Page 81: An Intro to Hadoop

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

Page 82: An Intro to Hadoop

Pig demo

Page 83: An Intro to Hadoop

Hive

Page 84: An Intro to Hadoop

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

Page 85: An Intro to Hadoop

Sqoop

*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

Page 86: An Intro to Hadoop

HBase

Page 87: An Intro to Hadoop

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

Page 88: An Intro to Hadoop

HBase Demo

Page 89: An Intro to Hadoop

on

Page 90: An Intro to Hadoop
Page 91: An Intro to Hadoop

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD

Page 92: An Intro to Hadoop

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8

Page 93: An Intro to Hadoop

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA

Page 94: An Intro to Hadoop

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

Page 95: An Intro to Hadoop

EMR Languages

���������������������������������������������!�������������������!!���������

Page 96: An Intro to Hadoop

EMR Languages

���������������������������������������������!�������������������!!���������

Page 97: An Intro to Hadoop

EMR Pricing

Page 98: An Intro to Hadoop

EMR Pricing

Page 99: An Intro to Hadoop

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

Page 100: An Intro to Hadoop

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

Page 101: An Intro to Hadoop
Page 102: An Intro to Hadoop

Final Thoughts

Page 103: An Intro to Hadoop

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

Page 104: An Intro to Hadoop

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

Page 105: An Intro to Hadoop

Does this Hadoop make my

list look ugly?

Page 106: An Intro to Hadoop

�2:=FC6�!*�2446AE23=6

Page 107: An Intro to Hadoop

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6

Page 108: An Intro to Hadoop

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM

Page 109: An Intro to Hadoop

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

Page 110: An Intro to Hadoop

Use Hadoop!

Page 111: An Intro to Hadoop

Thanks

Page 112: An Intro to Hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

Page 113: An Intro to Hadoop

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Page 114: An Intro to Hadoop

Credits

Page 115: An Intro to Hadoop

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

Page 116: An Intro to Hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�

1Friday, January 15, 2010

Page 117: An Intro to Hadoop

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

2Friday, January 15, 2010

Page 118: An Intro to Hadoop

3Friday, January 15, 2010

Page 119: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html

4Friday, January 15, 2010

Page 120: An Intro to Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

5Friday, January 15, 2010

Page 121: An Intro to Hadoop

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

”6Friday, January 15, 2010

Page 122: An Intro to Hadoop

http://www.cern.ch/7Friday, January 15, 2010

Page 123: An Intro to Hadoop

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010

Page 124: An Intro to Hadoop

Today

-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

http://hadoop.apache.org9Friday, January 15, 2010

Page 125: An Intro to Hadoop

Today

http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010

Page 126: An Intro to Hadoop

Today

11Friday, January 15, 2010

Page 127: An Intro to Hadoop

Why Hadoop?

12Friday, January 15, 2010

Page 128: An Intro to Hadoop

$74.85

1tb

4gb

13Friday, January 15, 2010

Page 129: An Intro to Hadoop

vs$10,000

$1,000

14Friday, January 15, 2010

Page 130: An Intro to Hadoop

vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010

Page 131: An Intro to Hadoop

Bzzzt!

Crrrkt!

Sproinnnng!

http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html

16Friday, January 15, 2010

Page 132: An Intro to Hadoop

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8���D

�@4F>6?ED

17Friday, January 15, 2010

Page 133: An Intro to Hadoop

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

http://nosql.mypopescu.com/18Friday, January 15, 2010

Page 134: An Intro to Hadoop

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

19Friday, January 15, 2010

Page 135: An Intro to Hadoop

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

20Friday, January 15, 2010

Page 136: An Intro to Hadoop

Particle Physics

#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C

http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg

21Friday, January 15, 2010

Page 137: An Intro to Hadoop

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

22Friday, January 15, 2010

Page 138: An Intro to Hadoop

Contextual Ads

*9@AA6C�� �FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E����

H9@�3@F89E�.

23Friday, January 15, 2010

Page 139: An Intro to Hadoop

Hadoop Family

24Friday, January 15, 2010

Page 140: An Intro to Hadoop

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

25Friday, January 15, 2010

Page 141: An Intro to Hadoop

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the Playersthe PlayAs

http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/

26Friday, January 15, 2010

Page 142: An Intro to Hadoop

MapReduce>2A�E96?����F>����C65F46�

27Friday, January 15, 2010

Page 143: An Intro to Hadoop

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �

28Friday, January 15, 2010

Page 144: An Intro to Hadoop

Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6

=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE

29Friday, January 15, 2010

Page 145: An Intro to Hadoop

Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E

30Friday, January 15, 2010

Page 146: An Intro to Hadoop

Reduce

@FE����24E65���� J62CD����

7@FC���

2E����

@=5����

?@E����

2>����

!���

8=25����

D@>6@?6����=:<6����24E����DE:==����

J6E���

31Friday, January 15, 2010

Page 147: An Intro to Hadoop

MapReduce Demo

32Friday, January 15, 2010

Page 148: An Intro to Hadoop

HDFS

33Friday, January 15, 2010

Page 149: An Intro to Hadoop

34Friday, January 15, 2010

Page 150: An Intro to Hadoop

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D

35Friday, January 15, 2010

Page 151: An Intro to Hadoop

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�

)��$*�@C�=@8�7:=6D

36Friday, January 15, 2010

Page 152: An Intro to Hadoop

37Friday, January 15, 2010

Page 153: An Intro to Hadoop

HDFS Demo

38Friday, January 15, 2010

Page 154: An Intro to Hadoop

Pig

39Friday, January 15, 2010

Page 155: An Intro to Hadoop

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code

40Friday, January 15, 2010

Page 156: An Intro to Hadoop

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

41Friday, January 15, 2010

Page 157: An Intro to Hadoop

Pig demo

42Friday, January 15, 2010

Page 158: An Intro to Hadoop

Hive

43Friday, January 15, 2010

Page 159: An Intro to Hadoop

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

44Friday, January 15, 2010

Page 160: An Intro to Hadoop

Sqoop

*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010

Page 161: An Intro to Hadoop

HBase

46Friday, January 15, 2010

Page 162: An Intro to Hadoop

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

47Friday, January 15, 2010

Page 163: An Intro to Hadoop

HBase Demo

48Friday, January 15, 2010

Page 164: An Intro to Hadoop

on

49Friday, January 15, 2010

Page 165: An Intro to Hadoop

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality

50Friday, January 15, 2010

Page 166: An Intro to Hadoop

EMR Languages

���������������������������������������������!�������������������!!���������

51Friday, January 15, 2010

Page 167: An Intro to Hadoop

EMR Pricing

Pay for both columns additively52Friday, January 15, 2010

Page 168: An Intro to Hadoop

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

53Friday, January 15, 2010

Page 169: An Intro to Hadoop

54Friday, January 15, 2010

Page 170: An Intro to Hadoop

Final Thoughts

55Friday, January 15, 2010

Page 171: An Intro to Hadoop

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010

Page 172: An Intro to Hadoop

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

57Friday, January 15, 2010

Page 173: An Intro to Hadoop

Does this Hadoop make my

list look ugly?

http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010

Page 174: An Intro to Hadoop

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

59Friday, January 15, 2010

Page 175: An Intro to Hadoop

Use Hadoop!

http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010

Page 176: An Intro to Hadoop

Thanks

61Friday, January 15, 2010

Page 177: An Intro to Hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

62Friday, January 15, 2010

Page 178: An Intro to Hadoop

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

63Friday, January 15, 2010

Page 179: An Intro to Hadoop

Credits

64Friday, January 15, 2010

Page 180: An Intro to Hadoop

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

65Friday, January 15, 2010