an intro to hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89��>3:6?E�!562D��##�

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89��>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay [email protected], [email protected]

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041



Google, Inc.


1 Introduction


given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

“

”

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

Today

-6CD:@?��$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

Why Hadoop?

$74.85

$74.85

4gb

$74.85

1tb

4gb

vs$10,000

$1,000

vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

Bzzzt!

Crrrkt!

Sproinnnng!

)��$*

Structured

&��

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8��D

�@4F>6?ED

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6

NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

Applications

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

Applications

Applications

'C:46�D62C49

Applications

'C:46�D62C49*E682?@8C2A9J

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

Particle Physics

Particle Physics

#2C86� 25C@?��@==:56C

Particle Physics

#2C86� 25C@?��@==:56C��A6E23JE6D�@7�52E2�A6C�J62C

Financial Trends

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

Contextual Ads

Contextual Ads

*9@AA6C�� FDE@>6C�

�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E��

H9@�3@F89E�.

Hadoop Family

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

the Players

the PlayAs

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the PlayAs

MapReduce

MapReduce>2A�E96?��F>��C65F46�

The process

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<��=:DE��G��=:DE�G �

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE

Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6=:<624EDE:==!J6E

@FE

Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E

Reduce

@FE��24E65�� J62CD��

7@FC��

2E��

@=5��

?@E��

2>��

!��

8=25��

D@>6@?6��=:<6��24E��DE:==��

J6E��

MapReduce Demo

HDFS Basics

HDFS Basics

�2D65�@?��@@8=6��:8+23=6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?��$��3=@4<D

Data Overload

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

7:=6D

HDFS Demo

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

Pig demo

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

Sqoop

*B@@A�3J��:D�9:896C�=6G6=DB@@A��4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

HBase Demo

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD


@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8


@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA


@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

EMR Languages

��!��!!��

EMR Pricing

EMR Functions

�� C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED��:?DE2?46D�2?5�368:?D�AC@46DD:?8�� 'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D�� 55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� �� +6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

Final Thoughts

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

Does this Hadoop make my

list look ugly?

�2:=FC6�!*�2446AE23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

Use Hadoop!

Thanks

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Credits

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>��8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>��8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� 9EEA�HHH�7=:4<C�4@>A9@E@D�� %� �� 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED��9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� �� 9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89��>3:6?E�!562D��##�

1Friday, January 15, 2010

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89��>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89




Google, Inc.


1 Introduction


given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html




Google, Inc.


1 Introduction


given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

1


MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

“

”6Friday, January 15, 2010

http://www.cern.ch/7Friday, January 15, 2010

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010

Today

-6CD:@?��$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

http://hadoop.apache.org9Friday, January 15, 2010

Today

http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010

Today


Why Hadoop?


$74.85

1tb

4gb


vs$10,000

$1,000


vs

Buy ou

r

way

out

of Failure

Failu

re is

inevitab

le

Go Cheap

This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010

Bzzzt!

Crrrkt!

Sproinnnng!

http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html


#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8��D

�@4F>6?ED


NOSQL

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

http://nosql.mypopescu.com/18Friday, January 15, 2010

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�


Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�


Particle Physics

#2C86� 25C@?��@==:56C��A6E23JE6D�@7�52E2�A6C�J62C

http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg


Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=


Contextual Ads

*9@AA6C�� FDE@>6C�

�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�

�FDE@>6C��

2=D@�3@F89E�/

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E��

H9@�3@F89E�.


Hadoop Family


':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components


0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the Playersthe PlayAs

http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/


MapReduce>2A�E96?��F>��C65F46�


The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<��=:DE��G��=:DE�G �


Map

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

8=25

7@FCD@>6@?6

=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

Start

@FE

�E�7@FC�J62CD�@=5�!�24E65�@FE


Grouping

@FE24E65

!

@=5 J62CD

7@FC2E

@=5

J62CD

7@FC?@E

2>

!8=25

7@FC

D@>6@?6=:<624EDE:==

!

J6E


Reduce

@FE��24E65�� J62CD��

7@FC��

2E��

@=5��

?@E��

2>��

!��

8=25��

D@>6@?6��=:<6��24E��DE:==��

J6E��


MapReduce Demo


HDFS


HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?��$��3=@4<D


Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�

)��$*�@C�=@8�7:=6D


HDFS Demo


Pig


Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code


Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';


Pig demo


Hive


Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2


Sqoop

*B@@A�3J��:D�9:896C�=6G6=DB@@A��4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010

HBase


HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*


HBase Demo


on



@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality


EMR Languages

��!��!!��


EMR Pricing

Pay for both columns additively52Friday, January 15, 2010

EMR Functions

�� C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED��:?DE2?46D�2?5�368:?D�AC@46DD:?8�� 'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D�� 55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� �� +6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�


Final Thoughts


Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2


Does this Hadoop make my

list look ugly?

http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M


Use Hadoop!

http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010

Thanks


Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2


Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89


Credits


9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>��8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>��8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� 9EEA�HHH�7=:4<C�4@>A9@E@D�� %� �� 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED��9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� �� 9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>


an intro to hadoop

Technology

past ve years

man ling ma

large clustersjeffrey

sand mapreduce

executionsand

generating

reducethe

achieves high