big data and apache hadoop's mapreduce -...
TRANSCRIPT
Big Data and Apache Hadoop’s MapReduce
Michael Hahsler
Computer Science and EngineeringSouthern Methodist University
January 23, 2012
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 2 / 23
Single Machine
Disk
CPU
Memory
Typical set up for data processing/mining!What are the problems with big data?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 3 / 23
Big Data Challenge
Data Sources
Internet and social networks
Sensors and science
Needed Infrastructure
Scale to thousands of CPUs
Run on cheap commodity hardware (fault-tolerant hardware isexpensive!)
Automatically handle data replication and node failure
Data distribution and load balancing
Easy to implement solutions (thinking in terms of parallel computingis hard!)
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 4 / 23
Hardware: Cluster Architecture
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
Node failure!
...
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 5 / 23
Software: Apache Hadoop
What is Apache Hadoop?
A software framework that supports data-intensive (petabytes) distributedapplications under a free license.
...inspired by Google’s MapReduce and Google File System (GFS).
Hadoop provides:
Distributed file system HDFS
API to work with MapReduce
Job configuration and scheduling
Track progress and utilization
Written in the Java programming language.Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 6 / 23
What Problem can be solved with Hadoop?
Characteristics
Processing can easily be made in parallel (simple computations)
Process large amounts of unstructured data
Running batch jobs is acceptable
Examples
Creating statistics (word counting)
Searching (distributed grep)
Sorting
Indexing (postings list)
Document clustering
Graph algorithms (E.g. pagerank)
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 7 / 23
Who uses Hadoop?
Adobe
Amazon.com
AOL
IBM
Microsoft
NY Times
Yahoo!
...
Source http://wiki.apache.org/hadoop/PoweredBy
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 8 / 23
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 9 / 23
The Hadoop Distributed File System (HDFS)
Source: http://hadoop.apache.org/common/docs/current/hdfs_design.html
Files are split into large block size: 64 MB (typical fs has 4 kB)
Replication (2-3x on different racks) → Fault-tolerance
Master node stores meta information
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 10 / 23
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 11 / 23
MapReduce
Basic Idea1 Apply a map function to each input element and emit key/value pairs.
map(k1, v1)→ list(k2, v2)
2 Summarize the results for each key using a reduce function.
reduce(k2, list(v2))→ (k2, v3)
The user only has to specify the map and reduce functions and theframework takes care of the rest!
The user does not have to think about concurrency, load balancing, datadistribution, fault-tolerance!
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 12 / 23
Example: Word Counting
Count how often each word appears in a large number of documents.
1 void map ( String name , String document ){2 // name: document name
3 // document: document contents
4 for each word w in document :5 EmitIntermediate (w , ”1” )6 }7
8 void reduce ( String word , Iterator partialCounts ){9 // word: a word
10 // partialCounts: a list of aggregated partial counts
11 int sum = 0 ;12 for each pc in partialCounts :13 sum += ParseInt ( pc ) ;14 Emit ( word , AsString ( sum ) ) ;15 }
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 13 / 23
Example: Word Counting
the quick brown fox
the lazy old dog
the quick brown dog
the, 1the, 1the ,1
quick, 1quick, 1
brown, 1brown, 1
fox, 1
lazy, 1
old, 1
dog, 1dog, 1
fox, 1
lazy, 1
old, 1
dog, 2
brown, 2
quick, 2
the, 3
split map shuffle/sort reduce
the, 1
quick, 1
brown, 1
fox, 1
the, 1
lazy, 1
old, 1
dog, 1
the, 1
quick, 1
brown, 1
dog, 1
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 14 / 23
Execution of a MapReduce Job
Source: Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2004.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 15 / 23
Fault-tolerance
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
Node failure!
...
Task 1
Block 1 Block 1
Master reschedules Task 1 here
Master
ping
Master re-executes tasks for failed workers.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 16 / 23
Locality
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
SwitchRack 1
SwitchRack n
SwitchBackbone
Rack contains 16-64 nodes
... ...
...
Block 1 Block 2
Schedule Task which needs block 2 here or at least on this rack!
Master
Schedule map tasks near to the data to preserve network bandwidth.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 17 / 23
More Properties
Load Balancing: Subdivide work in many small tasks (� # ofworkers). Since the master dynamically assigns tasks to idle nodes,this automatically provides load balancing.
Chaining: MapReduce operations can be chained (output of oneoperation is input for another operation) to solve more complicatedcomputations.
Backup tasks: Reduce phase can only start after all map tasks arefinished (use backup tasks to avoid “stragglers”).
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 18 / 23
Table of Contents
1 Introduction
2 Hadoop Distributed File System
3 MapReduce
4 More Examples
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 19 / 23
Example: Distributed Grep
Map task?
Reduce task?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 20 / 23
Example: Creating an Inverted Index
Map task?
Reduce task?
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 21 / 23
Related Projects
Apache HBase: open source, non-relational, distributed databasemodeled after Google’s BigTable
Apache Hive: data warehouse infrastructure built on top of Hadoopfor providing data summarization, query, and analysis. Provides aSQL-like query language called HiveQL.
Apache Pig: is a platform for analyzing large data sets that consistsof a high-level language for expressing data analysis programs.
Apache Mahout: free implementations of distributed or otherwisescalable machine learning algorithms on the Hadoop platform.
Apache Lucene: a high-performance, full-featured text search enginelibrary written entirely in Java.
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 22 / 23
Reading
Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html
The Apache Hadoop Projecthttp://hadoop.apache.org/
Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 23 / 23