an intro to hadoop

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89��>3:6?E�!562D��##�

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89��>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawatjeff@google.com, sanjay@google.com

Google, Inc.

AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.

1 Introduction

Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Google, Inc.

1 Introduction

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004

Google, Inc.

1 Introduction

Google, Inc.

1 Introduction

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

-6CD:@?��$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

Why Hadoop?

$74.85

vs$10,000

$1,000

Buy ou

of Failure

inevitab

Go Cheap

Bzzzt!

Crrrkt!

Sproinnnng!

)��$*

Structured

&��

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8��D

�@4F>6?ED

�62E9�@7�E96�)��$*�:D�2�=:6

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

Applications

'C:46�D62C49

Applications

'C:46�D62C49*E682?@8C2A9J

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

Particle Physics

#2C86� 25C@?��@==:56C

Particle Physics

#2C86� 25C@?��@==:56C��A6E23JE6D�@7�52E2�A6C�J62C

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

Contextual Ads

*9@AA6C�� FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E��

H9@�3@F89E�.

Hadoop Family

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

the Players

the PlayAs

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the PlayAs

MapReduce

MapReduce>2A�E96?��F>��C65F46�

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<��=:DE��G��=:DE�G �

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

7@FCD@>6@?6=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

�E�7@FC�J62CD�@=5�!�24E65�@FE

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

7@FCD@>6@?6=:<624EDE:==!J6E

Grouping

@FE24E65

@=5 J62CD

7@FC2E

7@FC?@E

D@>6@?6=:<624EDE:==

Reduce

@FE��24E65�� J62CD��

7@FC��

2E��

@=5��

?@E��

2>��

!��

8=25��

D@>6@?6��=:<6��24E��DE:==��

J6E��

MapReduce Demo

HDFS Basics

�2D65�@?��@@8=6��:8+23=6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?��$��3=@4<D

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�

HDFS Demo

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

Pig demo

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

*B@@A�3J��:D�9:896C�=6G6=DB@@A��4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

HBase Demo

Amazon Elastic MapReduce

@DE65� 25@@A�4=FDE6CD

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

EMR Languages

��!��!!��

EMR Languages

��!��!!��

EMR Pricing

EMR Functions

�� C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED��:?DE2?46D�2?5�368:?D�AC@46DD:?8�� 'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D�� 55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� �� +6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�

EMR Functions

Final Thoughts

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

Does this Hadoop make my

list look ugly?

�2:=FC6�!*�2446AE23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

Use Hadoop!

Thanks

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Credits

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>��8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>��8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� 9EEA�HHH�7=:4<C�4@>A9@E@D�� %� �� 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED��9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� �� 9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

L�$2EE96H�$4�F==@F89��>3:6?E�!562D��##�

1Friday, January 15, 2010

Talk Metadata

+H:EE6C�>2EE96H>44F==� 25@@A!?EC@

$2EE96H�$4�F==@F89��>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Google, Inc.

1 Introduction

given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis

To appear in OSDI 20041

Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html

Google, Inc.

1 Introduction

MapReduce history

�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED

”6Friday, January 15, 2010

http://www.cern.ch/7Friday, January 15, 2010

$2A)65F46�:>A=6>6?E2E:@?

�@F?565�3J�

&A6?*@FC46�2E�

Origins

http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010

-6CD:@?��$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8

http://hadoop.apache.org9Friday, January 15, 2010

http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010

Why Hadoop?

$74.85

vs$10,000

$1,000

Buy ou

of Failure

inevitab

Go Cheap

This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010

Bzzzt!

Crrrkt!

Sproinnnng!

http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html

#@8�7:=6D

)��$*

Structured

Unstructured

&��

!>286D

)6=2E:@?D�DA2??:?8��D

�@4F>6?ED

�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D

http://nosql.mypopescu.com/18Friday, January 15, 2010

Applications

'C@E6:?�7@=5:?8�A92C>246FE:42=D�

*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�

Applications

'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�

Particle Physics

#2C86� 25C@?��@==:56C��A6E23JE6D�@7�52E2�A6C�J62C

http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg

Financial Trends

�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8

,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=

Contextual Ads

*9@AA6C�� FDE@>6C�

'C@5F4E�.

=@@<:?8�2E

'C@5F4E�/�FDE@>6C�

�FDE@>6C��

2=D@�3@F89E�/

5:5�?@E�3FJ

@7�@E96C�4

FDE@>6CD

:D�E@=5�E92E��

H9@�3@F89E�.

Hadoop Family

':8 :G6�@C6��@>>@?

�9F<H2 �2D6 ��*0@@"66A6C

Hadoop Components

0@@"66A6C :G6

�9F<H2 �@>>@? �2D6

��*

the Playersthe PlayAs

http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/

MapReduce>2A�E96?��F>��C65F46�

The process

�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G��=:DE�<��G��@==64ED�2?5�8C@FAD�A2:CD�7C@>��=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<��=:DE��G��=:DE�G �

24E65!@=5J62CD7@FC2E

@=5J62CD7@FC?@E2>!

7@FCD@>6@?6

=:<624EDE:==!J6E

�=25�!�2>�?@E�7@FC�J62CD�@=5

/6E�!�DE:==�24E�=:<6�

D@>6@?6�7@FC

�E�7@FC�J62CD�@=5�!�24E65�@FE

Grouping

@FE24E65

@=5 J62CD

7@FC2E

7@FC?@E

D@>6@?6=:<624EDE:==

Reduce

@FE��24E65�� J62CD��

7@FC��

2E��

@=5��

?@E��

2>��

!��

8=25��

D@>6@?6��=:<6��24E��DE:==��

J6E��

MapReduce Demo

HDFS Basics

�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?��$��3=@4<D

Data Overload

56DE:?2E:@?�

@G6C7=@H�7C@>EC25:E:@?2=�

)��$*�@C�=@8�7:=6D

HDFS Demo

Pig Basics

/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?

http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code

Pig Sample

A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';

Pig demo

Hive Basics

�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2

*B@@A�3J��:D�9:896C�=6G6=DB@@A��4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>

http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010

HBase Basics

*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*

HBase Demo

@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6

Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality

EMR Languages

��!��!!��

EMR Pricing

Pay for both columns additively52Friday, January 15, 2010

EMR Functions

Final Thoughts

Ha! Your Hadoop is

slower than my Hadoop!

Shut up!I’m reducing.

http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010

+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2

Does this Hadoop make my

list look ugly?

http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010

�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M

Use Hadoop!

http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010

Thanks

Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2

Contact

�>2EE96H>44F==� 25@@A!?EC@

>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89

Credits

9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>��8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>��8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� 9EEA�HHH�7=:4<C�4@>A9@E@D�� %� �� 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED��9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� �� 9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=��9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ��D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>

an intro to hadoop

past ve years

man ling ma

large clustersjeffrey

sand mapreduce

executionsand

generating

reducethe

achieves high

Technology

intro to big data hadoop

another intro to hadoop

intro to the hadoop stack @ april 2011 javamug

hadoop: an industry perspective

intro to hadoop...intro to hadoop bill graham - @billgraham...

introduction to running hadoop on the high performance...

big data camp intro hadoop

cloudera intro hadoop 111116 slides

intro to hadoop -...

308471 ch4 intro-big_data-hadoop

intro to r & hadoop

hadoop an introduction

intro to yarn (hadoop 2.0) & apex as yarn app (next gen big...

tut15981: deploying an elastic hadoop cluster - … ·...

hadoop 1.2.1 installation - university of michigan ·...

intro to apache apex - next gen native hadoop platform -...

intro to apache hadoop

intro to apache apex - next gen native hadoop platform

intro to mahout -- dc hadoop

intro to apache apex, the next gen hadoop platform for...