an intro to hadoop
DESCRIPTION
Moore’s law has finally hit the wall and CPU speeds have actually decreased in the last few years. The industry is reacting with hardware with an ever-growing number of cores and software that can leverage “grids” of distributed, often commodity, computing resources. But how is a traditional Java developer supposed to easily take advantage of this revolution? The answer is the Apache Hadoop family of projects. Hadoop is a suite of Open Source APIs at the forefront of this grid computing revolution and is considered the absolute gold standard for the divide-and-conquer model of distributed problem crunching. The well-travelled Apache Hadoop framework is currently being leveraged in production by prominent names such as Yahoo, IBM, Amazon, Adobe, AOL, Facebook and Hulu just to name a few.In this session, you’ll start by learning the vocabulary unique to the distributed computing space. Next, we’ll discover how to shape a problem and processing to fit the Hadoop MapReduce framework. We’ll then examine the incredible auto-replicating, redundant and self-healing HDFS filesystem. Finally, we’ll fire up several Hadoop nodes and watch our calculation process get devoured live by our Hadoop grid. At this talk’s conclusion, you’ll feel equipped to take on any massive data set and processing your employer can throw at you with absolute ease.TRANSCRIPT
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�
Talk Metadata
+H:EE6C�>2EE96H>44F==� 25@@A!?EC@
$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis
To appear in OSDI 20041
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
MapReduce history
�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED
“
”
$2A)65F46�:>A=6>6?E2E:@?
�@F?565�3J�
&A6?*@FC46�2E�
Origins
Today
-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8
Today
Today
Why Hadoop?
$74.85
$74.85
4gb
$74.85
1tb
4gb
vs
vs$10,000
$1,000
vs
vs
Buy ou
r
way
out
of Failure
Failu
re is
inevitab
le
Go Cheap
Bzzzt!
Crrrkt!
Sproinnnng!
)��$*
Structured
&��
#@8�7:=6D
)��$*
Structured
Unstructured
&��
!>286D
)6=2E:@?D�DA2??:?8���D
�@4F>6?ED
NOSQL
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D
Applications
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�
Applications
Applications
'C:46�D62C49
Applications
'C:46�D62C49*E682?@8C2A9J
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�
Particle Physics
Particle Physics
#2C86� 25C@?��@==:56C
Particle Physics
#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C
Financial Trends
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=
Contextual Ads
Contextual Ads
*9@AA6C�� �FDE@>6C�
�
'C@5F4E�.
=@@<:?8�2E
'C@5F4E�/�FDE@>6C�
�
�FDE@>6C��
2=D@�3@F89E�/
2=D@�3@F89E�/
5:5�?@E�3FJ
@7�@E96C�4
FDE@>6CD
:D�E@=5�E92E����
H9@�3@F89E�.
Hadoop Family
':8 :G6�@C6��@>>@?
�9F<H2 �2D6 ��*0@@"66A6C
Hadoop Components
the Players
the PlayAs
0@@"66A6C :G6
�9F<H2 �@>>@? �2D6
��*
the PlayAs
MapReduce
MapReduce>2A�E96?����F>����C65F46�
The process
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G��
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6=:<624EDE:==!J6E
�=25�!�2>�?@E�7@FC�J62CD�@=5
/6E�!�DE:==�24E�=:<6�
D@>6@?6�7@FC
Start
@FE
�E�7@FC�J62CD�@=5�!�24E65�@FE
Map
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6=:<624EDE:==!J6E
@FE
Grouping
@FE24E65
!
@=5 J62CD
7@FC2E
@=5
J62CD
7@FC?@E
2>
!8=25
7@FC
D@>6@?6=:<624EDE:==
!
J6E
Reduce
@FE����24E65���� J62CD����
7@FC���
2E����
@=5����
?@E����
2>����
!���
8=25����
D@>6@?6����=:<6����24E����DE:==����
J6E���
MapReduce Demo
HDFS
HDFS Basics
HDFS Basics
�2D65�@?��@@8=6��:8+23=6
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D
Data Overload
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
HDFS Demo
Pig
Pig Basics
/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?
Pig Sample
A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';
Pig demo
Hive
Hive Basics
�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2
Sqoop
*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>
HBase
HBase Basics
*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*
HBase Demo
on
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6
EMR Languages
���������������������������������������������!�������������������!!���������
EMR Languages
���������������������������������������������!�������������������!!���������
EMR Pricing
EMR Pricing
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
Final Thoughts
Ha! Your Hadoop is
slower than my Hadoop!
Shut up!I’m reducing.
+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2
Does this Hadoop make my
list look ugly?
�2:=FC6�!*�2446AE23=6
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M
Use Hadoop!
Thanks
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
Contact
�>2EE96H>44F==� 25@@A!?EC@
>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
Credits
9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�
1Friday, January 15, 2010
Talk Metadata
+H:EE6C�>2EE96H>44F==� 25@@A!?EC@
$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
2Friday, January 15, 2010
3Friday, January 15, 2010
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis
To appear in OSDI 20041
Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html
4Friday, January 15, 2010
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
5Friday, January 15, 2010
MapReduce history
�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED
“
”6Friday, January 15, 2010
http://www.cern.ch/7Friday, January 15, 2010
$2A)65F46�:>A=6>6?E2E:@?
�@F?565�3J�
&A6?*@FC46�2E�
Origins
http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010
Today
-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8
http://hadoop.apache.org9Friday, January 15, 2010
Today
http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010
Today
11Friday, January 15, 2010
Why Hadoop?
12Friday, January 15, 2010
$74.85
1tb
4gb
13Friday, January 15, 2010
vs$10,000
$1,000
14Friday, January 15, 2010
vs
Buy ou
r
way
out
of Failure
Failu
re is
inevitab
le
Go Cheap
This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010
Bzzzt!
Crrrkt!
Sproinnnng!
http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html
16Friday, January 15, 2010
#@8�7:=6D
)��$*
Structured
Unstructured
&��
!>286D
)6=2E:@?D�DA2??:?8���D
�@4F>6?ED
17Friday, January 15, 2010
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D
http://nosql.mypopescu.com/18Friday, January 15, 2010
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�
19Friday, January 15, 2010
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�
20Friday, January 15, 2010
Particle Physics
#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C
http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg
21Friday, January 15, 2010
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=
22Friday, January 15, 2010
Contextual Ads
*9@AA6C�� �FDE@>6C�
�
'C@5F4E�.
=@@<:?8�2E
'C@5F4E�/�FDE@>6C�
�
�FDE@>6C��
2=D@�3@F89E�/
2=D@�3@F89E�/
5:5�?@E�3FJ
@7�@E96C�4
FDE@>6CD
:D�E@=5�E92E����
H9@�3@F89E�.
23Friday, January 15, 2010
Hadoop Family
24Friday, January 15, 2010
':8 :G6�@C6��@>>@?
�9F<H2 �2D6 ��*0@@"66A6C
Hadoop Components
25Friday, January 15, 2010
0@@"66A6C :G6
�9F<H2 �@>>@? �2D6
��*
the Playersthe PlayAs
http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/
26Friday, January 15, 2010
MapReduce>2A�E96?����F>����C65F46�
27Friday, January 15, 2010
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �
28Friday, January 15, 2010
Map
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6
=:<624EDE:==!J6E
�=25�!�2>�?@E�7@FC�J62CD�@=5
/6E�!�DE:==�24E�=:<6�
D@>6@?6�7@FC
Start
@FE
�E�7@FC�J62CD�@=5�!�24E65�@FE
29Friday, January 15, 2010
Grouping
@FE24E65
!
@=5 J62CD
7@FC2E
@=5
J62CD
7@FC?@E
2>
!8=25
7@FC
D@>6@?6=:<624EDE:==
!
J6E
30Friday, January 15, 2010
Reduce
@FE����24E65���� J62CD����
7@FC���
2E����
@=5����
?@E����
2>����
!���
8=25����
D@>6@?6����=:<6����24E����DE:==����
J6E���
31Friday, January 15, 2010
MapReduce Demo
32Friday, January 15, 2010
HDFS
33Friday, January 15, 2010
34Friday, January 15, 2010
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D
35Friday, January 15, 2010
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�
)��$*�@C�=@8�7:=6D
36Friday, January 15, 2010
37Friday, January 15, 2010
HDFS Demo
38Friday, January 15, 2010
Pig
39Friday, January 15, 2010
Pig Basics
/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?
http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code
40Friday, January 15, 2010
Pig Sample
A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';
41Friday, January 15, 2010
Pig demo
42Friday, January 15, 2010
Hive
43Friday, January 15, 2010
Hive Basics
�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2
44Friday, January 15, 2010
Sqoop
*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>
http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010
HBase
46Friday, January 15, 2010
HBase Basics
*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*
47Friday, January 15, 2010
HBase Demo
48Friday, January 15, 2010
on
49Friday, January 15, 2010
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6
Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality
50Friday, January 15, 2010
EMR Languages
���������������������������������������������!�������������������!!���������
51Friday, January 15, 2010
EMR Pricing
Pay for both columns additively52Friday, January 15, 2010
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
53Friday, January 15, 2010
54Friday, January 15, 2010
Final Thoughts
55Friday, January 15, 2010
Ha! Your Hadoop is
slower than my Hadoop!
Shut up!I’m reducing.
http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010
+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2
57Friday, January 15, 2010
Does this Hadoop make my
list look ugly?
http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M
59Friday, January 15, 2010
Use Hadoop!
http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010
Thanks
61Friday, January 15, 2010
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
62Friday, January 15, 2010
Contact
�>2EE96H>44F==� 25@@A!?EC@
>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
63Friday, January 15, 2010
Credits
64Friday, January 15, 2010
9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>
65Friday, January 15, 2010