an intro to hadoop

Post on 01-Nov-2014

19 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.

TRANSCRIPT

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

Today

-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

Today

Today

Why Hadoop?

$74.85

$74.85

4gb

$74.85

1tb

4gb

vs

vs$10,000

$1,000

vs

vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

Bzzzt!

Crrrkt!

Sproinnnng!

)��$*

Structured

&��

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8���D

�@4F>6?ED

NOSQL

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

Applications

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

Applications

Applications

'C:46�D62C49

Applications

'C:46�D62C49*E682?@8C2A9J

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

Particle Physics

Particle Physics

#2C86� 25C@?��@==:56C

Particle Physics

#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C

Financial Trends

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

Contextual Ads

Contextual Ads

*9@AA6C�� �FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E����

H9@�3@F89E�.

Hadoop Family

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

the Players

the PlayAs

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the PlayAs

MapReduce

MapReduce>2A�E96?����F>����C65F46�

The process

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G��

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE

Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

@FE

Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E

Reduce

@FE����24E65���� J62CD����

7@FC���

2E����

@=5����

?@E����

2>����

!���

8=25����

D@>6@?6����=:<6����24E����DE:==����

J6E���

MapReduce Demo

HDFS

HDFS Basics

HDFS Basics

�2D65�@?��@@8=6��:8+23=6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D

Data Overload

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

HDFS Demo

Pig

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

Pig demo

Hive

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

Sqoop

*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

HBase

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

HBase Demo

on

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

EMR Languages

���������������������������������������������!�������������������!!���������

EMR Languages

���������������������������������������������!�������������������!!���������

EMR Pricing

EMR Pricing

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

Final Thoughts

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

Does this Hadoop make my

list look ugly?

�2:=FC6�!*�2446AE23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

Use Hadoop!

Thanks

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Credits

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�

1Friday, January 15, 2010

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

2Friday, January 15, 2010

3Friday, January 15, 2010

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html

4Friday, January 15, 2010

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

5Friday, January 15, 2010

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

”6Friday, January 15, 2010

http://www.cern.ch/7Friday, January 15, 2010

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010

Today

-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

http://hadoop.apache.org9Friday, January 15, 2010

Today

http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010

Today

11Friday, January 15, 2010

Why Hadoop?

12Friday, January 15, 2010

$74.85

1tb

4gb

13Friday, January 15, 2010

vs$10,000

$1,000

14Friday, January 15, 2010

vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010

Bzzzt!

Crrrkt!

Sproinnnng!

http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html

16Friday, January 15, 2010

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8���D

�@4F>6?ED

17Friday, January 15, 2010

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

http://nosql.mypopescu.com/18Friday, January 15, 2010

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

19Friday, January 15, 2010

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

20Friday, January 15, 2010

Particle Physics

#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C

http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg

21Friday, January 15, 2010

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

22Friday, January 15, 2010

Contextual Ads

*9@AA6C�� �FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E����

H9@�3@F89E�.

23Friday, January 15, 2010

Hadoop Family

24Friday, January 15, 2010

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

25Friday, January 15, 2010

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the Playersthe PlayAs

http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/

26Friday, January 15, 2010

MapReduce>2A�E96?����F>����C65F46�

27Friday, January 15, 2010

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �

28Friday, January 15, 2010

Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6

=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE

29Friday, January 15, 2010

Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E

30Friday, January 15, 2010

Reduce

@FE����24E65���� J62CD����

7@FC���

2E����

@=5����

?@E����

2>����

!���

8=25����

D@>6@?6����=:<6����24E����DE:==����

J6E���

31Friday, January 15, 2010

MapReduce Demo

32Friday, January 15, 2010

HDFS

33Friday, January 15, 2010

34Friday, January 15, 2010

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D

35Friday, January 15, 2010

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�

)��$*�@C�=@8�7:=6D

36Friday, January 15, 2010

37Friday, January 15, 2010

HDFS Demo

38Friday, January 15, 2010

Pig

39Friday, January 15, 2010

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code

40Friday, January 15, 2010

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

41Friday, January 15, 2010

Pig demo

42Friday, January 15, 2010

Hive

43Friday, January 15, 2010

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

44Friday, January 15, 2010

Sqoop

*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010

HBase

46Friday, January 15, 2010

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

47Friday, January 15, 2010

HBase Demo

48Friday, January 15, 2010

on

49Friday, January 15, 2010

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality

50Friday, January 15, 2010

EMR Languages

���������������������������������������������!�������������������!!���������

51Friday, January 15, 2010

EMR Pricing

Pay for both columns additively52Friday, January 15, 2010

EMR Functions

�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

53Friday, January 15, 2010

54Friday, January 15, 2010

Final Thoughts

55Friday, January 15, 2010

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

57Friday, January 15, 2010

Does this Hadoop make my

list look ugly?

http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

59Friday, January 15, 2010

Use Hadoop!

http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010

Thanks

61Friday, January 15, 2010

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

62Friday, January 15, 2010

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

63Friday, January 15, 2010

Credits

64Friday, January 15, 2010

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

65Friday, January 15, 2010

top related